5  Programming & Advanced Features

Stata features the ability to create user-written commands. These can range from simple data manipulation commands to completely new statistical models. This is an advanced feature that not many users will need.

However, there are several components of the programming capabilities which are very useful even without writing your own commands. Here we’ll discuss several.

Let’s open a fresh version of “auto”:

. sysuse auto, clear
(1978 automobile data)

5.1 Macros

While variables stored as strings aren’t of much use to us, strings stored as other strings can be quite useful. Imagine the following scenario: You have a collection of 5 variables that you want to perform several different operations on. You might have code like this:

list var1 var2 var3 var4 var5 in 1/5
summarize var1 var2 var3 var4 var5
label define mylab 0 "No" 1 "Yes"
label values var1 var2 var3 var4 var5 mylab
duplicates list var1 var2 var3 var4 var5

This can get extremely tedious as the number of variables and commands increases. You could copy and paste a lot, but even that takes a lot of effort.

Instead, we can store the list of variables (strictly speaking, the string which contains the list of variables) in a shorter key string, and refer to that instead!

local vars = "var1 var2 var3 var4 var5"
list `vars' in 1/5
summarize `vars'
label define mylab 0 "No" 1 "Yes"
label values `vars' mylab
duplicates list `vars'

The first command, local, defines what is known as a “local macro”. Whenever it is referred to, wrapped in a backtick (to the left of the 1 key at the top-left of the keyboard) and a single quote, Stata replaces it with the original text. So when you enter

list `vars' in 1/5

Stata immediately replaces `vars' with var1 var2 var3 var4 var5, then executes

list var1 var2 var3 var4 var5 in 1/5

Important: Local macros are deleted as soon as code finishes executing! That means that you must use them in a do-file, and you must run all lines which create and access the macro at the same time, by highlighting them all.

If your macro contains text that should be quoted, you still need to quote it when accessing. For example, if you had

label variable price1 "Price (in dollars) at Time Point 1"
label variable price2 "Price (in dollars) at Time Point 2"

you could instead write

local pricelab = "Price (in dollars) at Time Point"
label variable price1 "`pricelab' 1"
label variable price2 "`pricelab' 2"

You can use display to print the content of macros to the output to preview them.

. local test = "abc"

. display "`test'"
abc

You may occasionally see code that excludes the = in defining a macro (e.g. local vars "var1 var2"). This matters when working with numeric macros. Using an = forces the evaluates the macro, exlcuding it doesn’t. For example,

. local x 1 + 3

. display "`x'"
1 + 3

. local y = 1 + 3

. display "`y'"
4

Note the use of quotations there. display would evaluate numerics regardless of = or not, so by treating it as a string, we can see the difference.

5.1.1 Global macros

local defines a “local macro”; global defines a “global macro”. Global macros persist between runs - whereas a local macro is removed after the code finishing executing, the global stays around.

Global macros are accessesd with $:

. global dog "Black Lab"

. display "$dog"
Black Lab

5.1.2 Class and Return

Every command in Stata is of a particular type. One major aspect of the type is what the command “returns”. Some commands are n-class, which means they don’t return anything. Some are c-class, which are only used by programmers and rarely useful elsewhere. The two common ones are e-class and r-class. The distinction between the two is inconsequential, besides that they store their “returns” in different places.

Here, summarize is a r-class command, so it stores its returns in “return”. We can see them all by return list. On the other hand, mean (which we haven’t discussed, but basically displays summary statistics similar to summarize but provides some additional functionality) is an e-class command, storing its results in ereturn:

. summarize price

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906

. return list

scalars:
                  r(N) =  74
              r(sum_w) =  74
               r(mean) =  6165.256756756757
                r(Var) =  8699525.974268788
                 r(sd) =  2949.495884768919
                r(min) =  3291
                r(max) =  15906
                r(sum) =  456229

. mean price

Mean estimation                             Number of obs = 74

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       price |   6165.257   342.8719      5481.914      6848.6
--------------------------------------------------------------

. ereturn list

scalars:
               e(df_r) =  73
             e(N_over) =  1
                  e(N) =  74
               e(k_eq) =  1
               e(rank) =  1

macros:
            e(cmdline) : "mean price"
                e(cmd) : "mean"
                e(vce) : "analytic"
              e(title) : "Mean estimation"
          e(estat_cmd) : "estat_vce_only"
            e(varlist) : "price"
       e(marginsnotok) : "_ALL"
         e(properties) : "b V"

matrices:
                  e(b) :  1 x 1
                  e(V) :  1 x 1
                 e(sd) :  1 x 1
                 e(_N) :  1 x 1
              e(error) :  1 x 1

functions:
             e(sample)   

Rather than try and keep track of what gets stored where, if you look at the very bottom of any help file, it will say something like “summarize stores the following in r():” or “mean stores the following in e():”, corresponding to return and ereturn respectively.

Along with the One Data principal, Stata also follows the One _-class principal - meaning you can only view the return or ereturn for the most recent command of that class. So if you run a summarize command, then do a bunch of n-class calls (gsort for example), the return list call will still give you the returns for that first summarize. However, as soon as you run another r-class command, you lose access to the first one. You can save any piece of it using a macro. For example, to calculate the average difference in price between foreign and domestic cars1:

. summarize price if foreign == 1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         22    6384.682    2621.915       3748      12990

. local fprice = r(mean)

. summarize price if foreign == 0

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         52    6072.423    3097.104       3291      15906

. local dprice = r(mean)

. display `dprice' - `fprice'
-312.25874

5.2 Variable Lists

Introduced in Stata 16, variable lists solves a common technique used in previous versions of Stata to define a global containing a list of variables to be used later in the document. For example, you might see something like this at the top of a Do file:

global predictors x1 x2 x3 x4

then further down the document something like

regress y $predictors
logit z $predictors

Stata has formalized this concept with the addition of the vl command (variable list). It works similarly to the use of globals: lists of variables are defined, then later reference via the $name syntax. However, using vl has the benefits of improved organization, customizations unique to variable lists, error checking, and overall convenience.

5.2.1 Initialization of Variable Lists

To begin using variable lists, vl set must be run.

. sysuse auto
(1978 automobile data)

. vl set

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vlcategorical  |       2   categorical variables
  $vlcontinuous   |       2   continuous variables
  $vluncertain    |       7   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------
Notes

      1. Review contents of vlcategorical and vlcontinuous to ensure they are
         correct.  Type vl list vlcategorical and type vl list vlcontinuous.

      2. If there are any variables in vluncertain, you can reallocate them
         to vlcategorical, vlcontinuous, or vlother.  Type
         vl list vluncertain.

      3. Use vl move to move variables among classifications.  For example,
         type vl move (x50 x80) vlcontinuous to move variables x50 and x80 to
         the continuous classification.

      4. vlnames are global macros.  Type the vlname without the leading
         dollar sign ($) when using vl commands.  Example: vlcategorical not
         $vlcategorical.  Type the dollar sign with other Stata commands to
         get a varlist.

This produces a surprisingly large amount of output. When you initialize the use of variable lists, Stata will automatically create four variable lists, called the “System variable lists”. Every numeric variable in the current data set is automatically placed into one of these four lists:

  • vlcategorical: Variables which Stata thinks are categorical. These generally have to be non-negative, integer valued variables with less than 10 unique values.
  • vlcontinuous: Variables which Stata thinks are continuous. These generally are variables which have negative values, have non-integer values, or are non-negative integers with more than 100 unique values.
  • vluncertain: Variables which Stata is unsure whether they are continuous or categorical. These generally are non-negative integer valued variables with between 10 and 100 unique values.
  • vlother: Any numeric variables that aren’t really useful - either all missing or constant variables.

There is a potential fifth system variable list, vldummy, which is created when option dummy is passed. Unsurprisingly, this will take variables containing only values 0 and 1 out of vlcategorical and into this list.

The “Notes” given below the output are generic; they appear regardless of how well Stata was able to categorize the variables. They can be suppressed with the nonotes option to vl set.

The two thresholds given above, 10 and 100, can be adjusted by the categorical and uncertain options. For example,

vl set, categorical(20) uncertain(50)

Running vl set on an already vl-set data set will result in an error, unless the clear option is given, which will re-generate the lists.

. vl set, dummy nonotes
one or more already classified variables specified
    You requested that variables be added to vl's system classifications, but
    you specified 11 variables that were already classified.
r(110);

. vl set, dummy nonotes clear

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       2   continuous variables
  $vluncertain    |       7   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------

In the above, we changed our minds and wanted to include the vldummy list, but since we’d already vl-set, we had the clear the existing set.

5.2.2 Viewing lists

When initializing the variable lists, we’re treated to a nice table of all defined lists. We can replay it via

. vl dir

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       2   continuous variables
  $vluncertain    |       7   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------

To see the actual contents of the variable lists, we’ll need to sue vl list.

. vl list

----------------------------------------------------
    Variable | Macro           Values         Levels
-------------+--------------------------------------
     foreign | $vldummy        0 and 1             2
       rep78 | $vlcategorical  integers >=0        5
    headroom | $vlcontinuous   noninteger           
  gear_ratio | $vlcontinuous   noninteger           
       price | $vluncertain    integers >=0       74
         mpg | $vluncertain    integers >=0       21
       trunk | $vluncertain    integers >=0       18
      weight | $vluncertain    integers >=0       64
      length | $vluncertain    integers >=0       47
        turn | $vluncertain    integers >=0       18
displacement | $vluncertain    integers >=0       31
----------------------------------------------------

This output produces one row for each variable in each variable list it is in. We haven’t used this yet, but variables can be in multiple lists.

We can list only specific lists:

. vl list vlcategorical

------------------------------------------------
Variable | Macro           Values         Levels
---------+--------------------------------------
   rep78 | $vlcategorical  integers >=0        5
------------------------------------------------

or specific variables

. vl list (turn weight)

------------------------------------------------
Variable | Macro           Values         Levels
---------+--------------------------------------
    turn | $vluncertain    integers >=0       18
  weight | $vluncertain    integers >=0       64
------------------------------------------------

If turn was in multiple variable lists, each would appear as a row in this output.

There’s a bit of odd notation which can be used to sort the output by variable name, which makes it easier to identify variables which appear in multiple lists.

. vl list (_all), sort

----------------------------------------------------
    Variable | Macro           Values         Levels
-------------+--------------------------------------
displacement | $vluncertain    integers >=0       31
     foreign | $vldummy        0 and 1             2
  gear_ratio | $vlcontinuous   noninteger           
    headroom | $vlcontinuous   noninteger           
      length | $vluncertain    integers >=0       47
         mpg | $vluncertain    integers >=0       21
       price | $vluncertain    integers >=0       74
       rep78 | $vlcategorical  integers >=0        5
       trunk | $vluncertain    integers >=0       18
        turn | $vluncertain    integers >=0       18
      weight | $vluncertain    integers >=0       64
----------------------------------------------------

The (_all) tells Stata to report on all variables, and sorting (when you specify at least one variable) orders by variable name rather than variable list name.

This will also list any numeric variables which are not found in any list.

5.2.2.1 Moving variables in system lists

After initializing the variable lists, if you plan on using the system lists, you may need to move variables around (e.g. classifying the vluncertain variables into their proper lists). This can be done via vl move which has the syntax

vl move (<variables to move>) <destination list>

For example, all the variables in vluncertain are actually continuous:

. vl list vluncertain

----------------------------------------------------
    Variable | Macro           Values         Levels
-------------+--------------------------------------
       price | $vluncertain    integers >=0       74
         mpg | $vluncertain    integers >=0       21
       trunk | $vluncertain    integers >=0       18
      weight | $vluncertain    integers >=0       64
      length | $vluncertain    integers >=0       47
        turn | $vluncertain    integers >=0       18
displacement | $vluncertain    integers >=0       31
----------------------------------------------------

. vl move (price mpg trunk weight length turn displacement) vlcontinuous
note: 7 variables specified and 7 variables moved.

------------------------------
Macro          # Added/Removed
------------------------------
$vldummy                     0
$vlcategorical               0
$vlcontinuous                7
$vluncertain                -7
$vlother                     0
------------------------------

. vl dir

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       9   continuous variables
  $vluncertain    |       0   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------

Alternatively, since we’re moving all variables in vluncertain, we can see our first use of the variable list!

. vl set, dummy nonotes clear

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       2   continuous variables
  $vluncertain    |       7   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------

. vl move ($vluncertain) vlcontinuous
note: 7 variables specified and 7 variables moved.

------------------------------
Macro          # Added/Removed
------------------------------
$vldummy                     0
$vlcategorical               0
$vlcontinuous                7
$vluncertain                -7
$vlother                     0
------------------------------

Note that variable lists are essentially just global macros so can be referred to via $name. Note, however, that the $ is only used when we want to actually use the variable list as a macro - in this case, we wanted to expand vluncertain into it’s list of variables. When we’re referring to a variable list in the vl commands, we do not use the $.

5.2.3 User Variable Lists

In addition to the System variable lists, you can define your own User variables lists, which I imagine will be used far more often. These are easy to create with vl create:

. vl create mylist1 = (weight mpg)
note: $mylist1 initialized with 2 variables.

. vl create mylist2 = (weight length trunk)
note: $mylist2 initialized with 3 variables.

. vl dir, user

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
User              |
  $mylist1        |       2   variables
  $mylist2        |       3   variables
-------------------------------------------------------------------------------

. vl list, user

------------------------------------------------
Variable | Macro           Values         Levels
---------+--------------------------------------
  weight | $mylist1        integers >=0       64
     mpg | $mylist1        integers >=0       21
  weight | $mylist2        integers >=0       64
  length | $mylist2        integers >=0       47
   trunk | $mylist2        integers >=0       18
------------------------------------------------

Note the addition of the user option to vl list and vl dir to show only User variable lists and suppress the System variable lists. We can also demonstrate the odd sorting syntax here:

. vl list (_all), sort user

----------------------------------------------------
    Variable | Macro           Values         Levels
-------------+--------------------------------------
displacement | not in vluser                      31
     foreign | not in vluser                       2
  gear_ratio | not in vluser                        
    headroom | not in vluser                        
      length | $mylist2        integers >=0       47
         mpg | $mylist1        integers >=0       21
       price | not in vluser                      74
       rep78 | not in vluser                       5
       trunk | $mylist2        integers >=0       18
        turn | not in vluser                      18
      weight | $mylist1        integers >=0       64
      weight | $mylist2        integers >=0       64
----------------------------------------------------

You can refer to variable lists in all the usual shortcut ways:

vl create mylist = (q x1-x100 z*)

We can add labels to variable lists:

. vl label mylist1 "Related to gas consumption"

. vl label mylist2 "Related to size"

. vl dir, user

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
User              |
  $mylist1        |       2   Related to gas consumption
  $mylist2        |       3   Related to size
-------------------------------------------------------------------------------

5.2.3.1 Modifying User Variable Lists

First, note that with User Variable Lists, the vl move command does not work. It only works with system variable lists.

We can create new user variable lists which build off old lists with vl create. To add a new variable:

. vl create mylist3 = mylist2 + (gear_ratio)
note: $mylist3 initialized with 4 variables.

. vl list, user

--------------------------------------------------
  Variable | Macro           Values         Levels
-----------+--------------------------------------
    weight | $mylist1        integers >=0       64
       mpg | $mylist1        integers >=0       21
    weight | $mylist2        integers >=0       64
    length | $mylist2        integers >=0       47
     trunk | $mylist2        integers >=0       18
    weight | $mylist3        integers >=0       64
    length | $mylist3        integers >=0       47
     trunk | $mylist3        integers >=0       18
gear_ratio | $mylist3        noninteger           
--------------------------------------------------

. vl create mylist4 = mylist2 - (turn)
note: $mylist4 initialized with 3 variables.

. vl list, user

--------------------------------------------------
  Variable | Macro           Values         Levels
-----------+--------------------------------------
    weight | $mylist1        integers >=0       64
       mpg | $mylist1        integers >=0       21
    weight | $mylist2        integers >=0       64
    length | $mylist2        integers >=0       47
     trunk | $mylist2        integers >=0       18
    weight | $mylist3        integers >=0       64
    length | $mylist3        integers >=0       47
     trunk | $mylist3        integers >=0       18
gear_ratio | $mylist3        noninteger           
    weight | $mylist4        integers >=0       64
    length | $mylist4        integers >=0       47
     trunk | $mylist4        integers >=0       18
--------------------------------------------------

Instead of adding (or removing) single variables at a time, we can instead add or remove lists. Keeping with the comment above, you do not use $ here to refer to the list.

. vl create mylist5 = mylist2 - mylist1
note: $mylist5 initialized with 2 variables.

. vl list mylist5

------------------------------------------------
Variable | Macro           Values         Levels
---------+--------------------------------------
  length | $mylist5        integers >=0       47
   trunk | $mylist5        integers >=0       18
------------------------------------------------

However, if we want to simply modify an existing list, a better approach would be the vl modify command. vl create and vl modify are similar to generate and replace; the former creates a new variable list while the later changes an existing variable list, but the syntax right of the = is the same.

. vl modify mylist3 = mylist3 + (headroom)
note: 1 variable added to $mylist3.

. vl modify mylist3 = mylist3 - (weight)
note: 1 variable removed from $mylist3.

5.2.4 Dropping variable list

Variable lists can be dropped via vl drop

. vl dir, user

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
User              |
  $mylist1        |       2   Related to gas consumption
  $mylist2        |       3   Related to size
  $mylist3        |       4   variables
  $mylist4        |       3   variables
  $mylist5        |       2   variables
-------------------------------------------------------------------------------

. vl drop mylist4 mylist5

. vl dir, user

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
User              |
  $mylist1        |       2   Related to gas consumption
  $mylist2        |       3   Related to size
  $mylist3        |       4   variables
-------------------------------------------------------------------------------

System lists cannot be dropped; if you run vl drop vlcontinuous it just removes all the variables from it.

5.2.5 Using Variable Lists

To be explicit, we can use variable lists in any command which would take the variables in that list. For example,

. describe $mylist3

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
length          int     %8.0g                 Length (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
gear_ratio      float   %6.2f                 Gear ratio
headroom        float   %6.1f                 Headroom (in.)

. describe $vlcategorical

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
rep78           int     %8.0g                 Repair record 1978

We can also use them in a modeling setting.

. regress mpg $mylist3

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(4, 69)        =     30.77
       Model |  1565.65298         4  391.413244   Prob > F        =    0.0000
    Residual |  877.806484        69  12.7218331   R-squared       =    0.6408
-------------+----------------------------------   Adj R-squared   =    0.6199
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.5668

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      length |  -.1837962   .0327629    -5.61   0.000    -.2491564   -.1184361
       trunk |  -.0103867   .1627025    -0.06   0.949    -.3349693    .3141959
  gear_ratio |   1.526952    1.27546     1.20   0.235    -1.017521    4.071426
    headroom |   .0136375   .6602514     0.02   0.984    -1.303528    1.330803
       _cons |   51.33708   8.300888     6.18   0.000     34.77727     67.8969
------------------------------------------------------------------------------

However, we’ll run into an issue here - how to specify categorical variables or interactions? The vl substitute command creates “factor-variable lists” that can include factor variable indicators (i.), continuous variable indicators (c.), and interactions (# or ##). (The name “factor-variable list” is slightly disingenuous; you could create a “factor-variable list” that includes no actual factors, for example, if you wanted to interact two continuous variables.)

Creating a factor-varible list via vl substitute can be done by specifying variables or variable lists.

. vl substitute sublist1 = mpg mylist3

. display "$sublist1"
mpg length trunk gear_ratio headroom

. vl dir

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       9   continuous variables
  $vluncertain    |       0   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
User              |
  $mylist1        |       2   Related to gas consumption
  $mylist2        |       3   Related to size
  $mylist3        |       4   variables
  $sublist1       |           factor-variable list
-------------------------------------------------------------------------------

Note the use of display "$listname" instead of vl list. Factor-variable lists are not just lists of vairables, they also can include the features above, so must be displayed. Note that in the vl dir, “sublist1” has no number of variables listed, making it stand apart.

We can make this more interesting by actually including continuous/factor indicatores and/or interactions.

. vl substitute sublist2 = c.mylist1##i.vldummy

. display "$sublist2"
weight mpg i.foreign i.foreign#c.weight i.foreign#c.mpg

Note the need to specify that mylist1 is continuous (with c.). It follows the normal convention that Stata assumes predictors in a model are continuous by default, unless they’re invloved in an interaction, in which case it assumes they are factors by default.

. regress price $sublist2

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(5, 68)        =     16.82
       Model |   351163805         5  70232760.9   Prob > F        =    0.0000
    Residual |   283901591        68   4175023.4   R-squared       =    0.5530
-------------+----------------------------------   Adj R-squared   =    0.5201
       Total |   635065396        73  8699525.97   Root MSE        =    2043.3

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   4.415037   .8529259     5.18   0.000      2.71305    6.117024
         mpg |    237.691   125.0383     1.90   0.062    -11.81907     487.201
             |
     foreign |
    Foreign  |   8219.603   7265.713     1.13   0.262    -6278.902    22718.11
             |
     foreign#|
    c.weight |
    Foreign  |   .7408054   1.647504     0.45   0.654    -2.546738    4.028348
             |
     foreign#|
       c.mpg |
    Foreign  |  -257.4683    155.426    -1.66   0.102     -567.616    52.67938
             |
       _cons |  -13285.44   5149.648    -2.58   0.012    -23561.41   -3009.481
------------------------------------------------------------------------------

5.2.5.1 Updating factor-variable Lists

Factor-variable lists cannot be directly modified.

. display "$sublist1"
mpg length trunk gear_ratio headroom

. vl modify sublist1 = sublist1 - mpg
sublist1 not allowed
    vlusernames containing factor variables not allowed in this context
    r(198);

However, if you create a factor-variable list using only other variable lists, if those lists get updated, so does the factor-variable list!

. vl create continuous = (turn trunk)
note: $continuous initialized with 2 variables.

. vl create categorical = (rep78 foreign)
note: $categorical initialized with 2 variables.

. vl substitute predictors = c.continuous##i.categorical

. display "$predictors"
turn trunk i.rep78 i.foreign i.rep78#c.turn i.foreign#c.turn i.rep78#c.trunk i.
> foreign#c.trunk

. vl modify continuous = continuous - (trunk)
note: 1 variable removed from $continuous.

. quiet vl rebuild

. display "$predictors"
turn i.rep78 i.foreign i.rep78#c.turn i.foreign#c.turn

Note the call to vl rebuild. Among other things, it will re-generate the factor-variable lists. (It produces a vl dir output without an option to suppress it, hence the use of quiet.)

5.2.6 Stored Statistics

You may have noticed that certain characteristics of the variable are reported.

. vl list mylist3

--------------------------------------------------
  Variable | Macro           Values         Levels
-----------+--------------------------------------
  headroom | $mylist3        noninteger           
     trunk | $mylist3        integers >=0       18
    length | $mylist3        integers >=0       47
gear_ratio | $mylist3        noninteger           
--------------------------------------------------

This reports some characteristics of the variables (integer, whether it’s non-negative) and the number of unique values. We can also see some other statistics:

. vl list mylist3, min max obs

-------------------------------------------------------------------------------
Variable | Macro           Values         Levels       Min       Max        Obs
---------+---------------------------------------------------------------------
headroom | $mylist3        noninteger                  1.5         5         74
   trunk | $mylist3        integers >=0       18         5        23         74
  length | $mylist3        integers >=0       47       142       233         74
gear_r~o | $mylist3        noninteger                 2.19      3.89         74
-------------------------------------------------------------------------------

This is similar to codebook except faster; these characteristics are saved at the time the variable list is created or modified and not updated automatically. If the data changes, this does not get updated.

. drop if weight < 3000
(35 observations deleted)

. summarize weight

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      weight |         39    3653.846    423.5788       3170       4840

. vl list (weight), min max obs

-------------------------------------------------------------------------------
Variable | Macro           Values         Levels       Min       Max        Obs
---------+---------------------------------------------------------------------
  weight | $vlcontinuous   integers >=0       64      1760      4840         74
  weight | $mylist1        integers >=0       64      1760      4840         74
  weight | $mylist2        integers >=0       64      1760      4840         74
-------------------------------------------------------------------------------

To re-generate these stored statistics, we call vl set again, with the update option.

. vl set, update

-------------------------------------------------------------------------------
                  |                      Macro's contents
                  |------------------------------------------------------------
Macro             |  # Vars   Description
------------------+------------------------------------------------------------
System            |
  $vldummy        |       1   0/1 variable
  $vlcategorical  |       1   categorical variable
  $vlcontinuous   |       9   continuous variables
  $vluncertain    |       0   perhaps continuous, perhaps categorical variables
  $vlother        |       0   all missing or constant variables
-------------------------------------------------------------------------------

. vl list (weight), min max obs

-------------------------------------------------------------------------------
Variable | Macro           Values         Levels       Min       Max        Obs
---------+---------------------------------------------------------------------
  weight | $vlcontinuous   integers >=0       34      3170      4840         39
  weight | $mylist1        integers >=0       34      3170      4840         39
  weight | $mylist2        integers >=0       34      3170      4840         39
-------------------------------------------------------------------------------

When the update option is passed, variable lists are not affected, only stored statistics are updated.

5.3 Linking data sets

In addition to allowing multiple data sets to be open at a time, we can link frames together such that rows of data in each frames are connected to each-other and can inter-operate. This requires a linking variable in each data set which will connect the rows. The two data sets can be at the same levels or at different levels.

For example, we might have data sets collected from multiple waves of surveys and follow-ups during which the same people (modulo some non-responses) are contained in each data set. Then the person ID variable in the data sets would be the linking variable.

Another example might be one file at the person level, and another file at the city level. The linking variable would be city name, which would be unique in the city file, but could potentially be repeated in the person level file.

The command to link files is frlink and requires specifying both the linking variable(s) and the frame to link to.

frlink 1:1 linkvar, frame(otherframe)

Let’s load some data from NHANES. Each file contains a row per subject.

. frame reset

. frame rename default demographics

. frame create diet

. frame create bp

. 
. import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT", cle
> ar

. frame diet: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR1T
> OT_I.XPT", clear

. frame bp: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.
> XPT", clear

. frame dir
* bp            9544 x 21
* demographics  9971 x 47
* diet          9544 x 168

Note: Frames marked with * contain unsaved data.

So as you can see, the current frame is the “demographics” frame, and the other frames contains diet and blood pressure information. The variable seqn records person ID.

. frlink 1:1 seqn, frame(bp)
(427 observations in frame demographics unmatched)

. frlink 1:1 seqn, frame(diet)
(427 observations in frame demographics unmatched)

The 1:1 subcommand specifies that it is a 1-to-1 link - each person has no more than 1 row of data in each file. An alternative is m:1 which allows multiple rows in the main file to be linked to a single row in the second frame. 1:m is not allowed at this point in time.

These commands created two new variables bp and diet (the same new as the linked frames) which indicate which row of the linked from is connected with the given row.

. list bp diet in 25/29

     +-----------+
     | bp   diet |
     |-----------|
 25. | 25     25 |
 26. | 26     26 |
 27. |  .      . |
 28. | 27     27 |
 29. | 28     28 |
     +-----------+

Here we see that row 27 in the demographics file was not found in either “bp” or “diet” and thus has no entry in the bp or diet variables.

Links are tracked by the variables, we can see the current status of a link via frlink describe:

. frlink describe diet

  History:
  -----------------------------------------------------------------------------
    Link variable diet created on 18 Aug 2023 by

    . frlink 1:1 seqn, frame(diet)

    Frame diet contained an unnamed dataset
  -----------------------------------------------------------------------------
  Verifying linkage ...
  Linkage is up to date.

We can see all links from the current frame via frlink dir:

. frlink dir
  (2 frlink variables found)
  -----------------------------------------------------------------------------
  bp created by frlink 1:1 seqn, frame(bp)
  -----------------------------------------------------------------------------
  diet created by frlink 1:1 seqn, frame(diet)
  -----------------------------------------------------------------------------
  Note: Type "frlink describe varname" to find out more, including whether
  the variable is still valid.

To unlink frames, simply drop the variable.

. drop diet

Finally, the names of the created variables can be modified via the generate option to frlink:

. frlink 1:1 seqn, frame(diet) generate(linkdiet)
(427 observations in frame demographics unmatched)

. frlink dir
  (2 frlink variables found)
  -----------------------------------------------------------------------------
  bp created by frlink 1:1 seqn, frame(bp)
  -----------------------------------------------------------------------------
  linkdiet created by frlink 1:1 seqn, frame(diet) generate(linkdiet)
  -----------------------------------------------------------------------------
  Note: Type "frlink describe varname" to find out more, including whether
  the variable is still valid.

5.3.1 Working with linked frames

Once we have linked frames, we can use variables in the linked frame in analyses on the main frame.

The frget command can copy variables from the linked frame into the primary frame.

. summarize bpxchr
variable bpxchr not found
r(111);

. frget bpxchr, from(bp)
(8,033 missing values generated)
(1 variable copied from linked frame)

. summarize bpxchr

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      bpxchr |      1,938    106.5614    21.75754         58        190

This merges appropriately, with a 1:1 or m:1 link, to properly associate the values of the variable with the right observations.

Alternatively, when using generate, we can reference a variable in another frame.

. generate nonsense = frval(linkdiet, dr1tcalc)/frval(bp, bpxpls) + dmdhrage
(3,158 missing values generated)

Note that this calculation used variables from all three frames. A less nonsensical example might be where we want the percent of a countries population located in a given state. Imagine we have the primary frame of county data, and then a separate frame “state” containing state level information.

generate percentpopulation = population/frval(state, population)

5.4 Loops

Using macros can simplify code if you have to use the same string repeatedly, but what if you want to perform the same command repeatedly with different variables? Here we can use a foreach loop. This is easiest to see with examples.

. sysuse pop2000, clear
(2000 U.S. Census population by age and sex)

The “pop2000” data contains data from the 2000 census, broken down by age and gender. The values are in total counts, lets say instead we want percentages by gender. For example, what percentage of Asians in age 25-29 are male? We could generate this manually.

. generate maletotalperc = maletotal/total

. generate femtotalperc = femtotal/total

. generate malewhiteperc = malewhite/white

This gets tedious fast as we need a total of 12 lines! Notice, however, that each line has a predictable pattern:

generate <gender><race>perc = <gender><race>/<race>

We can exploit this by creating a foreach loop over the racial categories and only needing a single command.

. drop *perc

. foreach race of varlist total-island {
  2.   generate male`race'perc = male`race'/`race'
  3.   generate fem`race'perc = fem`race'/`race'
  4. }

. list *perc in 1, abbreviate(100)

     +---------------------------------------------------------------+
  1. | maletotalperc | femtotalperc | malewhiteperc  | femwhiteperc  |
     |      .5116206 |     .4883794 |      .5130497  |     .4869503  |
     |---------------+--------------+----------------+---------------|
     | maleblackperc | femblackperc | maleindianperc | femindianperc |
     |      .5078017 |     .4921983 |       .5100116 |      .4899884 |
     |---------------+--------------+----------------+---------------|
     | maleasianperc | femasianperc | maleislandperc | femislandperc |
     |      .5029027 |     .4970973 |       .5167261 |      .4832739 |
     +---------------------------------------------------------------+

Let’s breakdown each piece of the command. The command syntax for foreach is

foreach <new macroname> of varlist <list of variables>

The loop will create a macro that you name (in the example above, it was named “race”), and repeatedly set it to each subsequent entry in the list of variables. So in the code above, first “race” is set to “total”, then the two generate commands are run. Next, “race” is set to “white”, then the two commands are run. Etc.

Within each of the generate commands, we use the backtick-quote notation just like with macros.

Finally, we end the foreach line with an open curly brace, {, and the line after the last command within the loop has the matching close curly brace, }.

We can also nest these loops. Notice that both generate statements above are identical except for “male” vs “fem”. Let’s put an internal loop:

. drop *perc

. foreach race of varlist total-island {
  2.   foreach gender in male fem {
  3.     generate `gender'`race'perc = `gender'`race'/`race'
  4.   }
  5. }

. list *perc in 1, ab(100)

     +---------------------------------------------------------------+
  1. | maletotalperc | femtotalperc | malewhiteperc  | femwhiteperc  |
     |      .5116206 |     .4883794 |      .5130497  |     .4869503  |
     |---------------+--------------+----------------+---------------|
     | maleblackperc | femblackperc | maleindianperc | femindianperc |
     |      .5078017 |     .4921983 |       .5100116 |      .4899884 |
     |---------------+--------------+----------------+---------------|
     | maleasianperc | femasianperc | maleislandperc | femislandperc |
     |      .5029027 |     .4970973 |       .5167261 |      .4832739 |
     +---------------------------------------------------------------+

Each time “race” gets set to a new variable, we enter another loop where “gender” gets set first to “male” then to “fem”. To help visualize it, here is what “race” and “gender” are set to each time the gen command is run:

generate command “race” “gender”
1 total male
2 total fem
3 white male
4 white fem
5 black male
6 black fem
7 indian male
8 indian fem
9 asian male
10 asian fem
11 island male
12 island fem

Notice the syntax of the above two foreach differs slightly:

foreach <macro name> of varlist <variables>
foreach <macro name> in <list of strings>

It’s a bit annoying, but Stata handles the “of” and “in” slight differently. The “in” treats any strings on the right as strict. Meaning if the above loop over race were

foreach race in total-island

then Stata would set “race” to “total-island” and the generate command would run once! By using “of varlist”, you are telling Stata that before it sets “race” to anything, expand the varlist using the rules such as * and -.

There is also

foreach <macro name> of numlist <list of numbers>

The benefit of of numlist is that numlists support things like 1/4 representing 1, 2, 3, 4. So

foreach num of numlist 1 3/5

Loops over 1, 3, 4, 5, whereas

foreach num in 1 3/5

loops over just “1” and “3/5”.

The use of in is for when you need to loop over strings that are neither numbers nor variables (such as “male” and “fem” from above).

5.5 Suppressing output and errors

There are two useful command prefixes that can be handy while writing more elaborate Do-files.

5.5.1 Capturing an error

Imagine the following scenario. You want to write a Do-file that generates a new variable. However, you may need to re-run chunks of the Do-file repeatedly, so that the generate statement is hit repeatedly. After the first generate, we can’t call it again and need to use replace instead. However, if we used replace, it wouldn’t work the first time! One solution is to drop the variable before we generate it:

. sysuse auto, clear
(1978 automobile data)

. drop newvar
variable newvar not found
r(111);

. generate newvar = 1

That error, while not breaking the code, is awfully annoying! However, if we prefix it by capture, the error (and all output from the command) are “captured” and hidden.

. list price in 1/5

     +-------+
     | price |
     |-------|
  1. | 4,099 |
  2. | 4,749 |
  3. | 3,799 |
  4. | 4,816 |
  5. | 7,827 |
     +-------+

. capture list price in 1/5

. list abcd
variable abcd not found
r(111);

. capture list abcd

Therefore, the best way to generate our new variable is

. capture drop newvar

. generate newvar = 1

5.5.1.1 Return Code

When you capture a command that errors, Stata saves the error code in the _rc macro.

. list abc
variable abc not found
r(111);

. capture list abc

. display _rc
111

If the command does not error, _rc contains 0.

. capture list price

. display _rc
0

This can be used to offer additional code if an error occurs

capture <some command>
if _rc > 0 {
  ...
} else {
  ...
}

If the command inside the capture runs without error, the if block will run. If the command inside the capture errors, the else block will run.

Say you wanted to rename a variable if it exists, and if doesn’t exist, create it. (For example, you have to process a large number of files, and in some files, this variable may be missing for all rows and thus not reported.) You could run the following:

capture rename oldvar newvar
if _rc > 0 {
  generate newvar = .
}

5.5.2 Quieting the output

quietly does the same basic thing as capture, except it does not hide errors. It can be useful combined with the returns:

. quietly summarize price

. display r(mean)
6165.2568

This will come in very handy when you start running statistical models, where the output can be over a single screen, whereas you only want a small piece of it.

Just to make the difference between capture and quietly clear:

. list price in 1/5

     +-------+
     | price |
     |-------|
  1. | 4,099 |
  2. | 4,749 |
  3. | 3,799 |
  4. | 4,816 |
  5. | 7,827 |
     +-------+

. quietly list price in 1/5

. capture list price in 1/5

. list abcd in 1/5
variable abcd not found
r(111);

. quietly list abcd in 1/5
variable abcd not found
r(111);

. capture list abcd in 1/5

With a command that doesn’t error (listing price), both quietly and capture perform the same. However, with a command that does error, quietly still errors, whereas capture just ignores it!


  1. There are obviously other ways to compute this, but this gives a flavor of the use.↩︎