A place to store and share the code I wrote to analyze the data generated by a recent Amazon Mechanical Turk survey experiment. The analysis from this experiment was recently accepted for publication as "Income Disclosure and Consumer Judgment in a Multi-level Marketing Experiment" in Journal of Consumer Affairs.
Austin M. Miller 7/29/2022
This document walks through the R script that I wrote to transform the data generated by a recent Amazon Mechanical Turk survey experiment in preparation for analysis. The goal of the analysis was to assess the impact of voluntary income disclosures in MLM marketing materials on consumer interest and earnings expectations. All participants were introduced to an MLM opportunity using marketing materials from the website of an actual MLM firm. The control group did not receive any income disclosure information; treatment group 1 received the income disclosure document created by the MLM firm itself; and treatment group 2 received an augmented form of the firm’s income disclosure information that included a graph and presented how many participants in the firm actually earned zero dollars. The analysis from this experiment was recently accepted for publication as “Income Disclosure and Consumer Judgment in a Multi-level Marketing Experiment” in Journal of Consumer Affairs.
library(tidyverse)
library(stringr)
library(modelr)
library(broom)
library(lmtest)
library(sandwich)
library(stargazer)
library(readxl)
I don’t know if this script actually uses most of these packages; many of them are just included in most of my R scripts by default.
rawdat <- read_excel("data/raw/MLMDisclosureData.xlsx") %>%
filter(StartDate>="2020-04-01") %>%
filter(Progress==100) %>%
.[,colSums(is.na(.))<nrow(.)] %>%
mutate(T1 = if_else(!is.na(Q69), 1, 0)) %>%
mutate(T2 = if_else(!is.na(Q150), 1, 0)) %>%
mutate(C = if_else(!is.na(Q20), 1, 0))
The *.xlsx data file from Qualtrics contains all responses to the
survey, even practice responses that researchers created before the
survey was officially launched. These lines filter out the practice
responses as well as any incomplete responses and a few variables that
were generated from Qualtrics with NA
in every row. The mutate()
commands flag each response as being in Treatment 1, Treatment 2, or the
Control group.
mlmT1 <- rawdat %>% filter(T1==1) %>%
select(-Q20,-Q25,-Q21,-Q22,-Q23,-Q27,-Q138) %>%
select(-Q150,-Q151,-Q164,-Q165,-Q166,-Q167,-Q137) %>%
rename(interest = Q69,
interfr = Q70,
earnings = Q94,
earnmost = Q95,
earnleast = Q96,
over6 = Q97,
earnbta = Q98,
) %>%
mutate(treatment = "Company Disclosure",
tml = if_else(EarningsT1_DO_Q94 < EarningsT1_DO_Q95 &
EarningsT1_DO_Q94 < EarningsT1_DO_Q96 &
EarningsT1_DO_Q95 < EarningsT1_DO_Q96, 1, 0),
tlm = if_else(EarningsT1_DO_Q94 < EarningsT1_DO_Q95 &
EarningsT1_DO_Q94 < EarningsT1_DO_Q96 &
EarningsT1_DO_Q96 < EarningsT1_DO_Q95, 1, 0),
mtl = if_else(EarningsT1_DO_Q95 < EarningsT1_DO_Q94 &
EarningsT1_DO_Q95 < EarningsT1_DO_Q96 &
EarningsT1_DO_Q94 < EarningsT1_DO_Q96, 1, 0),
mlt = if_else(EarningsT1_DO_Q95 < EarningsT1_DO_Q94 &
EarningsT1_DO_Q95 < EarningsT1_DO_Q96 &
EarningsT1_DO_Q96 < EarningsT1_DO_Q94, 1, 0),
ltm = if_else(EarningsT1_DO_Q96 < EarningsT1_DO_Q94 &
EarningsT1_DO_Q96 < EarningsT1_DO_Q95 &
EarningsT1_DO_Q94 < EarningsT1_DO_Q95, 1, 0),
lmt = if_else(EarningsT1_DO_Q96 < EarningsT1_DO_Q94 &
EarningsT1_DO_Q96 < EarningsT1_DO_Q95 &
EarningsT1_DO_Q95 < EarningsT1_DO_Q94, 1, 0)
)
This section removes the questions that are not relevant to this
treatment group (Treatment 1, in this case), and renames all the
dependent variables. The mutate()
section here labels the treatment
group and creates a marker for the order in which earnings questions
were asked. All participants were asked how much money they think they
would earn in a typical year as well as the most and least they think
they could earn. The order of these three questions was randomized
across all participants. m
is for most, l
is for least, and t
is
for typical earnings.
The following section does the same thing as above for each of the other treatment groups (Treatment 2 and the Control group).
mlmT2 <- rawdat %>% filter(T2==1) %>%
select(-Q20,-Q25,-Q21,-Q22,-Q23,-Q27,-Q138) %>%
select(-Q69,-Q70,-Q94,-Q95,-Q96,-Q97,-Q98) %>%
rename(interest = Q150,
interfr = Q151,
earnings = Q164,
earnmost = Q165,
earnleast = Q166,
over6 = Q167,
earnbta = Q137
)%>%
mutate(treatment = "Graphical Disclosure",
tml = if_else(EarningsT2_DO_Q164 < EarningsT2_DO_Q165 &
EarningsT2_DO_Q164 < EarningsT2_DO_Q166 &
EarningsT2_DO_Q165 < EarningsT2_DO_Q166, 1, 0),
tlm = if_else(EarningsT2_DO_Q164 < EarningsT2_DO_Q165 &
EarningsT2_DO_Q164 < EarningsT2_DO_Q166 &
EarningsT2_DO_Q166 < EarningsT2_DO_Q165, 1, 0),
mtl = if_else(EarningsT2_DO_Q165 < EarningsT2_DO_Q164 &
EarningsT2_DO_Q165 < EarningsT2_DO_Q166 &
EarningsT2_DO_Q164 < EarningsT2_DO_Q166, 1, 0),
mlt = if_else(EarningsT2_DO_Q165 < EarningsT2_DO_Q164 &
EarningsT2_DO_Q165 < EarningsT2_DO_Q166 &
EarningsT2_DO_Q166 < EarningsT2_DO_Q164, 1, 0),
ltm = if_else(EarningsT2_DO_Q166 < EarningsT2_DO_Q164 &
EarningsT2_DO_Q166 < EarningsT2_DO_Q165 &
EarningsT2_DO_Q164 < EarningsT2_DO_Q165, 1, 0),
lmt = if_else(EarningsT2_DO_Q166 < EarningsT2_DO_Q164 &
EarningsT2_DO_Q166 < EarningsT2_DO_Q165 &
EarningsT2_DO_Q165 < EarningsT2_DO_Q164, 1, 0)
)
mlmTC <- rawdat %>% filter(C==1) %>%
select(-Q69,-Q70,-Q94,-Q95,-Q96,-Q97,-Q98) %>%
select(-Q150,-Q151,-Q164,-Q165,-Q166,-Q167,-Q137) %>%
rename(interest = Q20,
interfr = Q25,
earnings = Q21,
earnmost = Q22,
earnleast = Q23,
over6 = Q27,
earnbta = Q138
)%>%
mutate(treatment = "No Disclosure",
tml = if_else(EarningsC_DO_Q21 < EarningsC_DO_Q22 &
EarningsC_DO_Q21 < EarningsC_DO_Q23 &
EarningsC_DO_Q22 < EarningsC_DO_Q23, 1, 0),
tlm = if_else(EarningsC_DO_Q21 < EarningsC_DO_Q22 &
EarningsC_DO_Q21 < EarningsC_DO_Q23 &
EarningsC_DO_Q23 < EarningsC_DO_Q22, 1, 0),
mtl = if_else(EarningsC_DO_Q22 < EarningsC_DO_Q21 &
EarningsC_DO_Q22 < EarningsC_DO_Q23 &
EarningsC_DO_Q21 < EarningsC_DO_Q23, 1, 0),
mlt = if_else(EarningsC_DO_Q22 < EarningsC_DO_Q21 &
EarningsC_DO_Q22 < EarningsC_DO_Q23 &
EarningsC_DO_Q23 < EarningsC_DO_Q21, 1, 0),
ltm = if_else(EarningsC_DO_Q23 < EarningsC_DO_Q21 &
EarningsC_DO_Q23 < EarningsC_DO_Q22 &
EarningsC_DO_Q21 < EarningsC_DO_Q22, 1, 0),
lmt = if_else(EarningsC_DO_Q23 < EarningsC_DO_Q21 &
EarningsC_DO_Q23 < EarningsC_DO_Q22 &
EarningsC_DO_Q22 < EarningsC_DO_Q21, 1, 0),
)
I’ll break up the following transformations into multiple pipe %>%
chains so that I can describe each section separately.
mlm <- rbind(mlmT1,mlmT2,mlmTC) %>%
mutate(treatment = as.factor(treatment) %>% fct_relevel(ref="No Disclosure")
) %>%
mutate(earnorder = if_else(tml==1,"tml",
if_else(tlm==1,"tlm",
if_else(mtl==1,"mtl",
if_else(mlt==1,"mlt",
if_else(lmt==1,"lmt",
if_else(ltm==1,"ltm", NA_character_))))))
)
This first part binds the three treatment groups back into a single data
frame, and begins the transformations. First, I convert the treatment
variable to a factor and set the Control group as the reference group.
This will change the automatic behavior of the regressions when
treatment
is included as an independent variable, creating indicator
variables for each treatment group and leaving the Control group as the
omitted category.
I also create a composite earnorder
variable to act as a label for the
order in which earnings questions were asked (most, least, and typical
earnings).
mlm <- mlm %>%
mutate(interestg = interest,
interest = as.factor(interest) %>%
fct_recode("1" = "1- No Interest At All", "7" = "7- Extremely Interested") %>%
as.numeric(),
interestover1 = interest>1,
over6 = recode(over6, "0% (No Chance)"=0, "1-9% Chance"=0.05,
"10-19% Chance"=0.15, "20-29% Chance"=0.25, "30-39% Chance"=0.35,
"40-49% Chance"=0.45, "50-59% Chance"=0.55, "60-69% Chance"=0.65,
"70-79% Chance"=0.75, "80-89% Chance"=0.85, "90-99% Chance"=0.95,
"100% Chance (Certain)"=1
),
expenses = str_replace(mlm$Q28, "3,000-5,000", "4000") %>%
str_replace("100.00", "100") %>%
str_replace_all("[$.,?A-Za-z ]", "") %>%
as.numeric(),
earnhigh = if_else(earnings>1000,1,0),
earnfirst = !if_else(FL_17_DO_FL_21==1 | FL_17_DO_FL_25==1 | FL_17_DO_FL_22==1,
1, 0, missing = 0
)
)
This part transforms/creates several variables related to the dependent variables of the analysis:
interest
, over6
, and expenses
,
which were originally created as strings.
interest
named interestg
that maintains
the labels "1- No Interest At All"
and
"7- Extremely Interested"
, which I will use for creating the
graphs of the distribution of interest
.expenses
and you can see a couple ad-hoc
replacements in the code. Next time I will restrict responses to
be only numeric values like I did for the earnings questions.interestover1
is an alternative way of looking at interest
as a
dependent variables. As we will see, most respondents in this
experiment reported no interest in the presented MLM opportunity.
Whether income disclosures have any effect on whether a person has
any interest at all thus becomes an interesting question.earnhigh
and earnfirst
are indicator variables that are used in
parts of the analysis.mlm <- mlm %>%
mutate(gender = Q29,
woman = Q29=="Woman",
man = Q29=="Man",
age = Q30,
education = as.factor(Q33) %>%
fct_collapse("High school or less" = c("Some high school", "High school graduate"),
"College Graduate/Vocational Training" = c("Trade/technical/vocational training",
"College graduate",
"Some postgraduate education")
) %>%
fct_relevel("High school or less",
"Some college",
"College Graduate/Vocational Training",
"Post graduate degree"
),
educ_hs = education=="High school or less",
educ_sc = education=="Some college",
educ_cg = education=="College Graduate/Vocational Training",
educ_pg = education=="Post graduate degree",
income = as.factor(Q34) %>%
fct_relevel(levels= "Less than $24,999", "$25,000 to $49,999",
"$50,000 to 99,999", "$100,000 or more"
),
inc_0 = income=="Less than $24,999",
inc_25 = income=="$25,000 to $49,999",
inc_50 = income=="$50,000 to 99,999",
inc_100 = income=="$100,000 or more",
black = if_else(is.na(Q31_2), 0, 1),
white = if_else(is.na(Q31_4), 0, 1),
hispanic = Q32=="Hispanic or Latino",
other = if_else(is.na(Q31_5) & is.na(Q31_3) & is.na(Q31_1), 0, 1),
religious = Q35,
rel_nr = religious=="Not at all religious",
rel_lr = religious=="Not too religious",
rel_sr = religious=="Somewhat religious",
rel_vr = religious=="Very religious",
relig = !rel_nr,
knownMLM = Q43=="Yes",
knowMLMstill = Q136=="Yes",
wasMLM = Q44=="Yes",
nowMLM = Q127=="Yes",
MLMrecruited = Q46=="Yes")
Nothing too tricky—just a bunch of indicator variables and factors (with a bit of re-factoring here and there).
mlm <- mlm %>%
mutate(numeracy = if_else(Q36 == "More than 10 people", 1, 0, missing = 0) +
if_else(Q37 == "Exactly $15,000", 1, 0, missing = 0) +
if_else(Q38 == "More than $240", 1, 0, missing = 0),
finance = if_else(Q39 == "More than $102", 1, 0, missing = 0) +
if_else(Q40 == "Less than what you can buy today with the money in this account", 1, 0, missing = 0) +
if_else(Q41 == "False", 1, 0, missing = 0),
EVtest = if_else(Q42 == "More than $50", 1, 0, missing = 0),
knowledge = numeracy + finance + EVtest
) %>%
mutate(earnbta = ifelse(earnbta=="1- I Think I Would Earn Much Less Than Average Participant", 1, earnbta),
earnbta = ifelse(earnbta=="4- I Think I Would Earn As Much As Average Participant", 4, earnbta),
earnbta = ifelse(earnbta=="7- I Think I Would Earn Much More Than Average Participant", 7, earnbta) %>%
as.numeric(),
earnbta4 = earnbta>4
)
Things get a little more interesting here. numeracy
, finance
, and
EVtest
are scores generated by adding up the number of correct answers
from a series of test questions. Then I iteratively mutated earnbta
instead of 1) converting it to a factor, 2) re-factoring, and then 3)
converting to a numeric variable (like I did with interest
in the
Dependent Variables section). I don’t know which way I like better. The
fact that I had to use ifelse()
instead of if_else()
in this one
section (because if_else()
requires both possible outcomes to be of
the same type) probably means I should have gone the re-factoring route.
What do you think?
cols <- c("Q47_1","Q47_2","Q47_3","Q47_4","Q47_5","Q47_6","Q47_7","Q47_8")
mlm[cols] <- lapply(mlm[cols], as.factor)
mlm[cols] <- map(mlm[cols], ~fct_recode(.,"1" = "Extremely Unlikely",
"2" = "Moderately Unlikely",
"3" = "Slightly Unlikely",
"4" = "Not Sure",
"5"= "Slightly Likely",
"6" = "Moderately Likely",
"7" = "Extremely Likely"
)
)
mlm[cols] <- lapply(mlm[cols], as.numeric)
mlm <- mlm %>%
mutate(risk = Q47_1 + Q47_2 + Q47_3 + Q47_4 + Q47_5 + Q47_6 + Q47_7 + Q47_8)
Here, I created the risk
variable in multiple steps. I had eight
different variables that all had to be re-coded the same way, so to
avoid repetition I just applied the same transformations to all eight
columns at once. I could probably rewrite this to match the style of the
rest of the document (which relies more heavily on pipes %>%
), but it
works just fine.
Before dropping any observations, I want to see if there are any patterns with respect to survey responses that have inconsistent earnings estimates (e.g., a maximum possible value smaller than a minimum possible value) and the order that the earnings questions are presented. For example, it may be the case that people expect a certain order to these types of questions and get confused if the order differs from their expectations.
mlm %>% filter(!(earnleast<=earnings & earnings<=earnmost)) %>%
select(earnorder) %>%
table()
## .
## lmt ltm mlt mtl tlm tml
## 13 7 8 14 3 2
Remember that m
is for most, l
is for least, and t
is for typical
earnings, so lmt
means that a person was asked the questions in the
following order:
There doesn’t seem to be an obvious pattern to which question order precedes the most errors. The three highest groups all have MOST being asked before TYPICAL, and the lowest-error groups have TYPICAL first. It is hard to say if question order influenced errors in this survey, but in the future it may be best to present the TYPICAL earnings question first. In this particular case, it is also noteworthy to point out that the TYPICAL question is the only question that specifies “before subtracting any expenses”, which may have been confusing for some.
mlm <- mlm %>%
filter(earnleast<=earnings & earnings<=earnmost) %>%
filter(earnings>=0, earnleast>=0, earnmost>=0) %>%
filter(!is.na(religious)) %>%
filter(nowMLM!=1|is.na(nowMLM))
Before the analysis, I also drop a few observations from the data. Three subjects were dropped due to missing values and three were dropped due to being the only subjects currently involved in MLM. Fifty additional subjects (8.3 percent) were dropped due to inconsistent earnings estimates (e.g., a maximum possible value smaller than a minimum possible value, or negative earnings). I also conducted an analysis (not included here) of whether those dropped in this step differed in any obvious way from the rest of the sample in terms of any of the other variables that were measured. Those who were dropped did answer slightly fewer prior-knowledge questions correctly (an average of 4.8 compared to 5.27 for those not dropped; p = 0.07) and were slightly more likely to be religious in general (58 percent compared to 44 percent; p = 0.06) and specifically very religious (24 percent compared to 12 percent; p= 0.06).
mlm <- mlm %>%
mutate(linterest = log(interest),
learnings = log(earnings+1),
learnmost = log(earnmost+1),
learnleast = log(earnleast+1),
lover6 = log(over6+1),
lexpenses = log(expenses+1),
learnbta = log(earnbta))
I saved these transformations until after I had dropped out all the
missing and inconsistent observations because log transforms don’t work
with zero or negative values. I still had to do the ol’ log(x+1)
transformation for variables that did have zeroes. Many of these
variables have extremely skewed distributions, and the log
transformations are much more normal and result in more linear
relationships.
write.csv(mlm,"data/mlm_2022_clean.csv",row.names = FALSE)
saveRDS(mlm,"data/mlm_2022_clean.rds")
rm(list=ls()[ls()!="mlm"])
I export the clean data both as a *.csv and as a *.rds. The *.csv file is more portable (e.g., I have co-authors who use Stata), but the *.rds file will preserve the order of factor levels. I worked to organize the factor levels because it makes presentation easier later.
The last line just cleans up all the lingering objects from the
environment, leaving only the mlm
data frame.