MLM Disclosure Experiment- Data Cleaning

Austin M. Miller 7/29/2022

Introduction
Load Packages
Read and Format Data
- Transformations Specific to Each Treatment Group
- Transformations For All Treatment Groups
Save Clean Data File

Introduction

This document walks through the R script that I wrote to transform the data generated by a recent Amazon Mechanical Turk survey experiment in preparation for analysis. The goal of the analysis was to assess the impact of voluntary income disclosures in MLM marketing materials on consumer interest and earnings expectations. All participants were introduced to an MLM opportunity using marketing materials from the website of an actual MLM firm. The control group did not receive any income disclosure information; treatment group 1 received the income disclosure document created by the MLM firm itself; and treatment group 2 received an augmented form of the firm’s income disclosure information that included a graph and presented how many participants in the firm actually earned zero dollars. The analysis from this experiment was recently accepted for publication as “Income Disclosure and Consumer Judgment in a Multi-level Marketing Experiment” in Journal of Consumer Affairs.

Load Packages

library(tidyverse)
library(stringr)
library(modelr)
library(broom)
library(lmtest)
library(sandwich)
library(stargazer)
library(readxl)

I don’t know if this script actually uses most of these packages; many of them are just included in most of my R scripts by default.

Read and Format Data

rawdat <- read_excel("data/raw/MLMDisclosureData.xlsx") %>% 
  filter(StartDate>="2020-04-01") %>%
  filter(Progress==100) %>%
  .[,colSums(is.na(.))<nrow(.)] %>%
  mutate(T1 = if_else(!is.na(Q69), 1, 0)) %>%
  mutate(T2 = if_else(!is.na(Q150), 1, 0)) %>%
  mutate(C  = if_else(!is.na(Q20), 1, 0)) 

The *.xlsx data file from Qualtrics contains all responses to the survey, even practice responses that researchers created before the survey was officially launched. These lines filter out the practice responses as well as any incomplete responses and a few variables that were generated from Qualtrics with NA in every row. The mutate() commands flag each response as being in Treatment 1, Treatment 2, or the Control group.

Transformations Specific to Each Treatment Group

mlmT1 <- rawdat %>% filter(T1==1) %>% 
  select(-Q20,-Q25,-Q21,-Q22,-Q23,-Q27,-Q138) %>%
  select(-Q150,-Q151,-Q164,-Q165,-Q166,-Q167,-Q137) %>%
  rename(interest = Q69, 
         interfr = Q70,
         earnings = Q94,
         earnmost = Q95,
         earnleast = Q96,
         over6 = Q97,
         earnbta = Q98,
         ) %>%
  mutate(treatment = "Company Disclosure",
    tml = if_else(EarningsT1_DO_Q94 < EarningsT1_DO_Q95 &
                  EarningsT1_DO_Q94 < EarningsT1_DO_Q96 &
                  EarningsT1_DO_Q95 < EarningsT1_DO_Q96, 1, 0),
    tlm = if_else(EarningsT1_DO_Q94 < EarningsT1_DO_Q95 &
                  EarningsT1_DO_Q94 < EarningsT1_DO_Q96 &
                  EarningsT1_DO_Q96 < EarningsT1_DO_Q95, 1, 0),
    mtl = if_else(EarningsT1_DO_Q95 < EarningsT1_DO_Q94 &
                  EarningsT1_DO_Q95 < EarningsT1_DO_Q96 &
                  EarningsT1_DO_Q94 < EarningsT1_DO_Q96, 1, 0),
    mlt = if_else(EarningsT1_DO_Q95 < EarningsT1_DO_Q94 &
                  EarningsT1_DO_Q95 < EarningsT1_DO_Q96 &
                  EarningsT1_DO_Q96 < EarningsT1_DO_Q94, 1, 0),
    ltm = if_else(EarningsT1_DO_Q96 < EarningsT1_DO_Q94 &
                  EarningsT1_DO_Q96 < EarningsT1_DO_Q95 &
                  EarningsT1_DO_Q94 < EarningsT1_DO_Q95, 1, 0),
    lmt = if_else(EarningsT1_DO_Q96 < EarningsT1_DO_Q94 &
                  EarningsT1_DO_Q96 < EarningsT1_DO_Q95 &
                  EarningsT1_DO_Q95 < EarningsT1_DO_Q94, 1, 0)
  )

This section removes the questions that are not relevant to this treatment group (Treatment 1, in this case), and renames all the dependent variables. The mutate() section here labels the treatment group and creates a marker for the order in which earnings questions were asked. All participants were asked how much money they think they would earn in a typical year as well as the most and least they think they could earn. The order of these three questions was randomized across all participants. m is for most, l is for least, and t is for typical earnings.

The following section does the same thing as above for each of the other treatment groups (Treatment 2 and the Control group).

mlmT2 <- rawdat %>% filter(T2==1) %>% 
  select(-Q20,-Q25,-Q21,-Q22,-Q23,-Q27,-Q138) %>%
  select(-Q69,-Q70,-Q94,-Q95,-Q96,-Q97,-Q98) %>%
  rename(interest = Q150, 
         interfr = Q151,
         earnings = Q164,
         earnmost = Q165,
         earnleast = Q166,
         over6 = Q167,
         earnbta = Q137
         )%>%
  mutate(treatment = "Graphical Disclosure",
      tml = if_else(EarningsT2_DO_Q164 < EarningsT2_DO_Q165 &
                    EarningsT2_DO_Q164 < EarningsT2_DO_Q166 &
                    EarningsT2_DO_Q165 < EarningsT2_DO_Q166, 1, 0),
      tlm = if_else(EarningsT2_DO_Q164 < EarningsT2_DO_Q165 &
                    EarningsT2_DO_Q164 < EarningsT2_DO_Q166 &
                    EarningsT2_DO_Q166 < EarningsT2_DO_Q165, 1, 0),
      mtl = if_else(EarningsT2_DO_Q165 < EarningsT2_DO_Q164 &
                    EarningsT2_DO_Q165 < EarningsT2_DO_Q166 &
                    EarningsT2_DO_Q164 < EarningsT2_DO_Q166, 1, 0),
      mlt = if_else(EarningsT2_DO_Q165 < EarningsT2_DO_Q164 &
                    EarningsT2_DO_Q165 < EarningsT2_DO_Q166 &
                    EarningsT2_DO_Q166 < EarningsT2_DO_Q164, 1, 0),
      ltm = if_else(EarningsT2_DO_Q166 < EarningsT2_DO_Q164 &
                    EarningsT2_DO_Q166 < EarningsT2_DO_Q165 &
                    EarningsT2_DO_Q164 < EarningsT2_DO_Q165, 1, 0),
      lmt = if_else(EarningsT2_DO_Q166 < EarningsT2_DO_Q164 &
                    EarningsT2_DO_Q166 < EarningsT2_DO_Q165 &
                    EarningsT2_DO_Q165 < EarningsT2_DO_Q164, 1, 0)
  )

mlmTC <- rawdat %>% filter(C==1) %>% 
  select(-Q69,-Q70,-Q94,-Q95,-Q96,-Q97,-Q98) %>%
  select(-Q150,-Q151,-Q164,-Q165,-Q166,-Q167,-Q137) %>%
  rename(interest = Q20, 
         interfr = Q25,
         earnings = Q21,
         earnmost = Q22,
         earnleast = Q23,
         over6 = Q27,
         earnbta = Q138
         )%>%
  mutate(treatment = "No Disclosure",
         tml = if_else(EarningsC_DO_Q21 < EarningsC_DO_Q22 &
                       EarningsC_DO_Q21 < EarningsC_DO_Q23 &
                       EarningsC_DO_Q22 < EarningsC_DO_Q23, 1, 0),
         tlm = if_else(EarningsC_DO_Q21 < EarningsC_DO_Q22 &
                       EarningsC_DO_Q21 < EarningsC_DO_Q23 &
                       EarningsC_DO_Q23 < EarningsC_DO_Q22, 1, 0),
         mtl = if_else(EarningsC_DO_Q22 < EarningsC_DO_Q21 &
                       EarningsC_DO_Q22 < EarningsC_DO_Q23 &
                       EarningsC_DO_Q21 < EarningsC_DO_Q23, 1, 0),
         mlt = if_else(EarningsC_DO_Q22 < EarningsC_DO_Q21 &
                       EarningsC_DO_Q22 < EarningsC_DO_Q23 &
                       EarningsC_DO_Q23 < EarningsC_DO_Q21, 1, 0),
         ltm = if_else(EarningsC_DO_Q23 < EarningsC_DO_Q21 &
                       EarningsC_DO_Q23 < EarningsC_DO_Q22 &
                       EarningsC_DO_Q21 < EarningsC_DO_Q22, 1, 0),
         lmt = if_else(EarningsC_DO_Q23 < EarningsC_DO_Q21 &
                       EarningsC_DO_Q23 < EarningsC_DO_Q22 &
                       EarningsC_DO_Q22 < EarningsC_DO_Q21, 1, 0),
  )

Transformations For All Treatment Groups

I’ll break up the following transformations into multiple pipe %>% chains so that I can describe each section separately.

mlm <- rbind(mlmT1,mlmT2,mlmTC) %>%
  mutate(treatment = as.factor(treatment) %>% fct_relevel(ref="No Disclosure")
         ) %>%
  mutate(earnorder = if_else(tml==1,"tml",
                     if_else(tlm==1,"tlm",
                     if_else(mtl==1,"mtl",
                     if_else(mlt==1,"mlt",
                     if_else(lmt==1,"lmt",
                     if_else(ltm==1,"ltm", NA_character_))))))
         )

This first part binds the three treatment groups back into a single data frame, and begins the transformations. First, I convert the treatment variable to a factor and set the Control group as the reference group. This will change the automatic behavior of the regressions when treatment is included as an independent variable, creating indicator variables for each treatment group and leaving the Control group as the omitted category.

I also create a composite earnorder variable to act as a label for the order in which earnings questions were asked (most, least, and typical earnings).

Dependent Variables

mlm <- mlm %>% 
  mutate(interestg = interest, 
        interest = as.factor(interest) %>% 
            fct_recode("1" = "1- No Interest At All", "7" = "7- Extremely Interested") %>% 
            as.numeric(),
        interestover1 = interest>1,
        over6 = recode(over6, "0% (No Chance)"=0, "1-9% Chance"=0.05,
                        "10-19% Chance"=0.15, "20-29% Chance"=0.25, "30-39% Chance"=0.35,
                        "40-49% Chance"=0.45, "50-59% Chance"=0.55, "60-69% Chance"=0.65,
                        "70-79% Chance"=0.75, "80-89% Chance"=0.85, "90-99% Chance"=0.95,
                        "100% Chance (Certain)"=1
                       ),
        expenses = str_replace(mlm$Q28, "3,000-5,000", "4000") %>%
            str_replace("100.00", "100") %>%
            str_replace_all("[$.,?A-Za-z ]", "") %>%
            as.numeric(),
        earnhigh = if_else(earnings>1000,1,0),
        earnfirst = !if_else(FL_17_DO_FL_21==1 | FL_17_DO_FL_25==1 | FL_17_DO_FL_22==1, 
                             1, 0, missing = 0
                             )
  )

This part transforms/creates several variables related to the dependent variables of the analysis:

I create numeric versions of interest, over6, and expenses, which were originally created as strings.
- I save a version of interest named interestg that maintains the labels "1- No Interest At All" and "7- Extremely Interested", which I will use for creating the graphs of the distribution of interest.
- I had to do some sleuthing to identify all the non-numeric responses in expenses and you can see a couple ad-hoc replacements in the code. Next time I will restrict responses to be only numeric values like I did for the earnings questions.
interestover1 is an alternative way of looking at interest as a dependent variables. As we will see, most respondents in this experiment reported no interest in the presented MLM opportunity. Whether income disclosures have any effect on whether a person has any interest at all thus becomes an interesting question.
earnhigh and earnfirst are indicator variables that are used in parts of the analysis.

Other Variables

mlm <- mlm %>% 
  mutate(gender = Q29,
        woman = Q29=="Woman",
        man = Q29=="Man",
        age = Q30,
        education = as.factor(Q33) %>%
            fct_collapse("High school or less" = c("Some high school", "High school graduate"),
                        "College Graduate/Vocational Training" = c("Trade/technical/vocational training", 
                                                                "College graduate",
                                                                "Some postgraduate education")
                        ) %>%
            fct_relevel("High school or less", 
                        "Some college",
                        "College Graduate/Vocational Training", 
                        "Post graduate degree"
                        ),
          educ_hs = education=="High school or less",
          educ_sc = education=="Some college",
          educ_cg = education=="College Graduate/Vocational Training",
          educ_pg = education=="Post graduate degree",
        income =  as.factor(Q34) %>%
            fct_relevel(levels= "Less than $24,999", "$25,000 to $49,999",
                                "$50,000 to 99,999", "$100,000 or more"
                        ),
          inc_0 = income=="Less than $24,999",
          inc_25 = income=="$25,000 to $49,999",
          inc_50 = income=="$50,000 to 99,999",
          inc_100 = income=="$100,000 or more",
        black = if_else(is.na(Q31_2), 0, 1),
        white = if_else(is.na(Q31_4), 0, 1),
        hispanic = Q32=="Hispanic or Latino",
        other = if_else(is.na(Q31_5) & is.na(Q31_3) & is.na(Q31_1), 0, 1),
        religious = Q35,
          rel_nr = religious=="Not at all religious",
          rel_lr = religious=="Not too religious",
          rel_sr = religious=="Somewhat religious",
          rel_vr = religious=="Very religious",
          relig = !rel_nr,
        knownMLM = Q43=="Yes",
        knowMLMstill = Q136=="Yes",
        wasMLM = Q44=="Yes",
        nowMLM = Q127=="Yes",
        MLMrecruited = Q46=="Yes") 

Nothing too tricky—just a bunch of indicator variables and factors (with a bit of re-factoring here and there).

mlm <- mlm %>% 
  mutate(numeracy = if_else(Q36 == "More than 10 people", 1, 0, missing = 0) + 
                    if_else(Q37 == "Exactly $15,000", 1, 0, missing = 0) + 
                    if_else(Q38 == "More than $240", 1, 0, missing = 0),
         finance = if_else(Q39 == "More than $102", 1, 0, missing = 0) + 
                    if_else(Q40 == "Less than what you can buy today with the money in this account", 1, 0, missing = 0) + 
                    if_else(Q41 == "False", 1, 0, missing = 0),
         EVtest = if_else(Q42 == "More than $50", 1, 0, missing = 0),
         knowledge = numeracy + finance + EVtest
          ) %>%
  mutate(earnbta = ifelse(earnbta=="1- I Think I Would Earn Much Less Than Average Participant", 1, earnbta),
         earnbta = ifelse(earnbta=="4- I Think I Would Earn As Much As Average Participant", 4, earnbta),
         earnbta = ifelse(earnbta=="7- I Think I Would Earn Much More Than Average Participant", 7, earnbta) %>%
            as.numeric(),
         earnbta4 = earnbta>4
         ) 

Things get a little more interesting here. numeracy, finance, and EVtest are scores generated by adding up the number of correct answers from a series of test questions. Then I iteratively mutated earnbta instead of 1) converting it to a factor, 2) re-factoring, and then 3) converting to a numeric variable (like I did with interest in the Dependent Variables section). I don’t know which way I like better. The fact that I had to use ifelse() instead of if_else() in this one section (because if_else() requires both possible outcomes to be of the same type) probably means I should have gone the re-factoring route.

What do you think?

cols <- c("Q47_1","Q47_2","Q47_3","Q47_4","Q47_5","Q47_6","Q47_7","Q47_8")
mlm[cols] <- lapply(mlm[cols], as.factor)
mlm[cols] <- map(mlm[cols], ~fct_recode(.,"1" = "Extremely Unlikely", 
                                          "2" = "Moderately Unlikely", 
                                          "3" = "Slightly Unlikely",
                                          "4" = "Not Sure", 
                                          "5"= "Slightly Likely", 
                                          "6" = "Moderately Likely", 
                                          "7" = "Extremely Likely"
                                        )
                 )
mlm[cols] <- lapply(mlm[cols], as.numeric)

mlm <- mlm %>% 
  mutate(risk = Q47_1 + Q47_2 + Q47_3 + Q47_4 + Q47_5 + Q47_6 + Q47_7 + Q47_8)

Here, I created the risk variable in multiple steps. I had eight different variables that all had to be re-coded the same way, so to avoid repetition I just applied the same transformations to all eight columns at once. I could probably rewrite this to match the style of the rest of the document (which relies more heavily on pipes %>%), but it works just fine.

Order of Earnings Questions

Before dropping any observations, I want to see if there are any patterns with respect to survey responses that have inconsistent earnings estimates (e.g., a maximum possible value smaller than a minimum possible value) and the order that the earnings questions are presented. For example, it may be the case that people expect a certain order to these types of questions and get confused if the order differs from their expectations.

mlm %>% filter(!(earnleast<=earnings & earnings<=earnmost)) %>% 
  select(earnorder) %>% 
  table()

## .
## lmt ltm mlt mtl tlm tml 
##  13   7   8  14   3   2

Remember that m is for most, l is for least, and t is for typical earnings, so lmt means that a person was asked the questions in the following order:

What is the LEAST you think you could earn in a year?
What is the MOST you think you could earn in a year?
How much money do you think you would earn in a TYPICAL YEAR, before subtracting any expenses?

There doesn’t seem to be an obvious pattern to which question order precedes the most errors. The three highest groups all have MOST being asked before TYPICAL, and the lowest-error groups have TYPICAL first. It is hard to say if question order influenced errors in this survey, but in the future it may be best to present the TYPICAL earnings question first. In this particular case, it is also noteworthy to point out that the TYPICAL question is the only question that specifies “before subtracting any expenses”, which may have been confusing for some.

Dropped Observations

mlm <- mlm %>% 
  filter(earnleast<=earnings & earnings<=earnmost) %>%
  filter(earnings>=0, earnleast>=0, earnmost>=0) %>%
  filter(!is.na(religious)) %>% 
  filter(nowMLM!=1|is.na(nowMLM))

Before the analysis, I also drop a few observations from the data. Three subjects were dropped due to missing values and three were dropped due to being the only subjects currently involved in MLM. Fifty additional subjects (8.3 percent) were dropped due to inconsistent earnings estimates (e.g., a maximum possible value smaller than a minimum possible value, or negative earnings). I also conducted an analysis (not included here) of whether those dropped in this step differed in any obvious way from the rest of the sample in terms of any of the other variables that were measured. Those who were dropped did answer slightly fewer prior-knowledge questions correctly (an average of 4.8 compared to 5.27 for those not dropped; p = 0.07) and were slightly more likely to be religious in general (58 percent compared to 44 percent; p = 0.06) and specifically very religious (24 percent compared to 12 percent; p= 0.06).

Logarithmic Transformations

mlm <- mlm %>% 
mutate(linterest = log(interest), 
        learnings = log(earnings+1),
        learnmost = log(earnmost+1),
        learnleast = log(earnleast+1),
        lover6 = log(over6+1),
        lexpenses = log(expenses+1),
        learnbta = log(earnbta)) 

I saved these transformations until after I had dropped out all the missing and inconsistent observations because log transforms don’t work with zero or negative values. I still had to do the ol’ log(x+1) transformation for variables that did have zeroes. Many of these variables have extremely skewed distributions, and the log transformations are much more normal and result in more linear relationships.

Save Clean Data File

write.csv(mlm,"data/mlm_2022_clean.csv",row.names = FALSE)
saveRDS(mlm,"data/mlm_2022_clean.rds")
rm(list=ls()[ls()!="mlm"])

I export the clean data both as a *.csv and as a *.rds. The *.csv file is more portable (e.g., I have co-authors who use Stata), but the *.rds file will preserve the order of factor levels. I worked to organize the factor levels because it makes presentation easier later.

The last line just cleans up all the lingering objects from the environment, leaving only the mlm data frame.

back to main