A place to store and share the code I wrote to analyze the data generated by a recent Amazon Mechanical Turk survey experiment. The analysis from this experiment was recently accepted for publication as "Income Disclosure and Consumer Judgment in a Multi-level Marketing Experiment" in Journal of Consumer Affairs.
Austin Miller 7/31/2022
This document presents the R script that I wrote to create tables of summary statistics for the data generated by a recent Amazon Mechanical Turk survey experiment. The goal of the analysis was to assess the impact of voluntary income disclosures in MLM marketing materials on consumer interest and earnings expectations. All participants were introduced to an MLM opportunity using marketing materials from the website of an actual MLM firm. The control group did not receive any income disclosure information; treatment group 1 received the income disclosure document created by the MLM firm itself; and treatment group 2 received an augmented form of the firm’s income disclosure information that included a graph and presented how many participants in the firm actually earned zero dollars. The analysis from this experiment was recently accepted for publication “Income Disclosure and Consumer Judgment in a Multi-level Marketing Experiment” in Journal of Consumer Affairs.
library(tidyverse)
library(stringr)
library(modelr)
library(broom)
library(lmtest)
library(sandwich)
library(stargazer)
library(Hmisc)
library(RVAideMemoire)
library(lawstat)
I don’t know if this script actually uses each of these packages; many of them are just included in most of my R scripts by default. Some less common packages may include:
Hmisc
contains the rcorr()
function for correlation tables with
significance tests.RVAideMemoire
contains mood.medtest()
for comparing medians.lawstat
contains levene.test()
for comparing variances.mlm <- readRDS("data/mlm_2022_clean.rds")
This line reads in the data created by the MLMExperiment_data.Rmd
document.
dependents <- c("interest","earnings","earnleast","earnmost","over6","expenses")
key_ind <- c("numeracy","finance","EVtest","educ_hs","educ_sc","educ_cg","educ_pg","earnbta")
other <- c("age","man","woman","black","white","hispanic","other","inc_0","inc_25","inc_50",
"inc_100","rel_nr","rel_lr","rel_sr","rel_vr","risk","knownMLM",
"wasMLM","MLMrecruited")
sumtable <-
mlm %>%
select(all_of(dependents), all_of(key_ind), all_of(other)) %>%
summarise(
across(
.fns = list(Observations = ~sum(!is.na(.x)),
Mean = ~mean(.x, na.rm = TRUE),
Median = ~median(.x, na.rm = TRUE),
Sd = ~sd(.x, na.rm = TRUE),
Min = ~min(.x, na.rm = TRUE),
Max = ~max(.x, na.rm = TRUE)
),
.names = "{.col}-{.fn}"
)
) %>%
pivot_longer(everything(), names_to=c("variable","stat"), values_to="val", names_sep="-") %>%
pivot_wider(names_from = stat, values_from = val) %>%
mutate(Sd = if_else(Max == 1 & Min == 0 & variable != "over6", NA_real_, Sd))
write.csv(sumtable, "descriptive/sumstats.csv")
The first thing I do is create some groups of variables so I can work
with a group at a time rather than listing all variables every time. The
full mlm
data frame actually contains 190 variables, but most of those
are the raw variables from Qualtrics and are not used. I keep them
around in the principle of non-destructive editing, I suppose. If I
decide that I want to create a new variable or re-transform something,
etc. (which I do a lot over the course of a project), then I don’t have
to go back and look for stuff in the original raw data or change the
line of code that keeps only “important” variables.
The lower part is a single pipe %>%
chain that creates and formats a
summary table of all the variables that will be important to the
analysis. The steps are:
.names
parameter in across()
from the default
"{.col}_{.fn}"
because I have some variables that already have
_
in their names and I want to use the separator (now -
)
during reshaping.summarise(across(...))
outputs one column with a single value for
each combination of variable and summary operation. In order to
reshape this into a more readable summary table:
pivot_longer()
to transpose the summary output. Each row
(rather than column) now represents a single summary operation
performed on a single variable.
names_sep="-"
to separate the .names
column into
two during the pivot. So now I have one column "variable"
identifying the variable being summarized, one column
"stat"
identifying the summary operation, and a column of
values.pivot_wider()
to move the summary operations (identified
by stat
) into columns.The final output is a table with one row for each variable and one column for each summary operation. It would require little formatting to prepare this table for inclusion in a report or presentation (though it is much too big too present at one time in a live presentation). In the following sections, I display portions of this table separately.
sumtable %>% filter(variable %in% dependents)
## # A tibble: 6 x 7
## variable Observations Mean Median Sd Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 interest 545 1.89 1 1.52 1 7
## 2 earnings 545 6352. 1000 16971. 0 200000
## 3 earnleast 545 2146. 25 10060. 0 165000
## 4 earnmost 545 89673. 7000 481325. 0 10000000
## 5 over6 545 0.208 0.05 0.261 0 1
## 6 expenses 381 3128. 1800 5907. 0 78000
Generally, interest in the presented opportunity is low with an average
response of 1.89 out of 7 (mean(!mlm$interestover1)
would reveal that
63 percent of all respondents report interest of
"1- No Interest At All"
). The distribution of estimated earnings is
wide, ranging from $0 to an estimated maximum earnings of $10 million.
The mean of respondents’ estimated annual earnings is $6,352. For
reference, the expected value of actual earnings described in the income
disclosure statements is $1,600 in a year, with a median of $0. The
many, much higher earnings estimations in the sample may indicate a lack
of data, a misunderstanding of the data, or overconfidence in the face
of data. The mean of respondents’ expected chance of earning more than
$6,000 in a year is about 21 percent, which also likely represents an
average overestimation of expected earnings (although most respondents
report this probability to be 5 percent or less as evidenced by the
median of 0.05).
Finally, we note that expenses
only has 381 observations. Respondents
were allowed to skip the question about expenses because no information
was given about expenses. The mean expense estimation is $3,128, which
is about half of the average expected earnings in this survey and more
than the gross earnings of most actual participants represented in the
income disclosure information.
These variables are included in the analysis as potentially modifying the effect of income disclosures in the MLM context.
Knowledge-based questions and level of education are used as a measure
of a person’s ability to interpret the information in an income
disclosure. numeracy
, finance
, and EVtest
are scores generated by
adding up the number of correct answers from a series of test questions.
A consumer’s self-perception is measured by earnbta
, where 4 is
interpreted as
"4- I Think I Would Earn As Much As Average Participant"
. Some people
who perceive themselves as better-than-average may be more likely to
perceive low probability, high earnings outcomes presented in an income
disclosure as achievable.
sumtable[c(-2)] %>% filter(variable %in% key_ind)
## # A tibble: 8 x 6
## variable Mean Median Sd Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 numeracy 2.27 3 0.878 0 3
## 2 finance 2.61 3 0.722 0 3
## 3 EVtest 0.393 0 NA 0 1
## 4 educ_hs 0.123 0 NA 0 1
## 5 educ_sc 0.246 0 NA 0 1
## 6 educ_cg 0.530 1 NA 0 1
## 7 educ_pg 0.101 0 NA 0 1
## 8 earnbta 3.05 3 1.36 1 7
The average number of correct numeracy questions in the sample is 2.27
out of 3, and the average number of correct finance answers is 2.61 out
of 3. Most people are able to answer all three of the numeracy and
finance test questions correctly (as evidenced by a median of 3 for both
variables). However, only 39 percent of respondents are able to answer
the one expected value question (represented by EVtest
) correctly.
In the educ
education categories, hs
represents high-school
graduation, sc
is some college without a degree, cg
reflects a
college degree or the completion of vocational training, and pg
is
some level of post-graduate education. As this sample is from workers on
Amazon Mechanical Turk, it is not wholly representative of the U.S.
adult population. One example of this is that our sample seems to be
more highly educated. Compared to the United States Census Bureau
(2021), our
sample is more likely to have obtained a college degree or vocational
training (53 percent of our sample versus near 30 percent in the U.S.).
The mean of earnbta
is 3.05 on a seven-point scale, with a median of
These variables are included to measure whether they correlate with interest in the MLM opportunity and/or estimated earnings.
sumtable[c(-2)] %>% filter(variable %in% other[c(1:7)])
## # A tibble: 7 x 6
## variable Mean Median Sd Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 41.8 40 10.9 22 76
## 2 man 0.503 1 NA 0 1
## 3 woman 0.494 0 NA 0 1
## 4 black 0.0679 0 NA 0 1
## 5 white 0.844 1 NA 0 1
## 6 hispanic 0.0385 0 NA 0 1
## 7 other 0.116 0 NA 0 1
By design, our sample is composed of working-age individuals and the
median age in our sample is 40. The sample is about evenly split between
men and women, and sum(man==0 & woman==0)
reveals that 2 respondents
identify as an other gender. Nearly 7 percent of subjects identify as
Black/African American (vs. 13.4 percent in the U.S.), 3.9 percent as
Hispanic/Latino (vs. 18.5 percent in the U.S.), and 11.6 percent as
American Indian, Alaskan Native, Hawaiian, or Some Other Race (vs. 7.5
percent in the U.S).
sumtable[c(-2)] %>% filter(variable %in% other[c(8:16)])
## # A tibble: 9 x 6
## variable Mean Median Sd Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 inc_0 0.171 0 NA 0 1
## 2 inc_25 0.336 0 NA 0 1
## 3 inc_50 0.367 0 NA 0 1
## 4 inc_100 0.127 0 NA 0 1
## 5 rel_nr 0.565 1 NA 0 1
## 6 rel_lr 0.136 0 NA 0 1
## 7 rel_sr 0.180 0 NA 0 1
## 8 rel_vr 0.119 0 NA 0 1
## 9 risk 28.2 28 7.79 13 52
Household income categories include inc_0
less than $25,000, inc_25
between $25,000 and $50,000, inc_50
between $50,000 and $100,000, and
inc_100
greater than $100,000. The median income in our sample is less
than $50,000 (17 percent and 33 percent in the smallest two categories),
which is less than the over $60,000 median income in the U.S..
Religiosity categories are rel_nr
“Not at all religious”, rel_lr
“Not too religious”, rel_sr
“Somewhat religious, and rel_vr
“Very
religious”. Fifty-six point five percent of our sample categorized
themselves as “Not at all religious”. This is likely a less religious
group than the U.S. as a whole; only 20 percent of respondents from the
2017 Baylor Religion Survey categorized themselves as “Not religious”
(see Baylor University,
2017).
Risk, here measured as the sum of eight seven-point-scale responses, is bounded between 8 and 56. A score of 32 would indicate that a person responded “Not Sure” to every risk-measuring question. The minimum risk value in our sample is 13; the maximum is 52. This distribution is fairly symmetric around 28.
sumtable[c(-2)] %>% filter(variable %in% other[c(17:19)])
## # A tibble: 3 x 6
## variable Mean Median Sd Min Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 knownMLM 0.468 0 NA 0 1
## 2 wasMLM 0.105 0 NA 0 1
## 3 MLMrecruited 0.543 1 NA 0 1
Nearly 47 percent of the sample knows someone personally who has participated in MLM, and over 54 percent have received recruitment attempts to join an MLM organization. Only 10 percent have participated in MLM at some point in their lives, but this participation rate is slightly higher than the rate estimated in a national survey of U.S. adults (see DeLiema et al., 2021), which found that 7.7 percent had participated in MLM at some point in the past.
This section generates a table of correlations between independent
variables and both interest
and earnings
.
rcorrlist <- mlm %>%
select(all_of(dependents), all_of(key_ind), all_of(other)) %>%
as.matrix() %>%
rcorr(type="pearson")
correlations <- as_tibble(rcorrlist$r) %>%
mutate(variable = row.names(rcorrlist$r)) %>%
pivot_longer(c(-"variable")) %>%
filter(as.vector(rcorrlist$P<.1)) %>%
pivot_wider() %>%
filter(!variable %in% dependents) %>%
filter(!(is.na(interest) & is.na(earnings))) %>%
select(variable, interest, earnings)
write.csv(correlations, "descriptive/correlations.csv")
The code here is a bit tricky, but worth it, I think. Using the
rcorr()
function from the package Hmisc
, I generate a matrix of
correlation coefficients rcorrlist$r
for every pair of variables,
simultaneously with a matrix of p-values rcorrlist$P
from t-tests for
whether each coefficient is equal to zero. The rcorr()
function takes
a matrix array as input so I have to convert back and forth from matrix
to data frame (or tibble
) a bit.
The second part of the code removes all the correlation coefficients
that don’t have a significantly low p-value (here I use p<0.1), keeps
only the correlation coefficients we want to report (in this case, each
independent variable paired with interest
and with earnings
), and
formats the resulting correlation table slightly for readability.
The order of operations here is important. Because I am filtering one
matrix by another (the correlation matrix by the logical matrix of
whether each pair is associated with a low enough p-value), I find it
practical to first remove the shape of each so both are just a single
dimension of values. But in order to keep track of which value is
associated with which pair of variables, I first create a column called
variable
that maintains the row names generated by rcorr()
. Then,
when I pivot_longer()
, I have one column of the original row names,
one column of the original column names, and one column of values. I
then filter the long data frame by the logical vector created by
as.vector(rcorrlist$P<.1)
. Now when I pivot_wider()
back to the
original shape, the correlations that are not statistically significant
are replaced with missing values and all of the correlation coefficients
have maintained the appropriate column and row names). All that’s left
is to filter out the rows I don’t need (including all the rows of
missing values) and order the columns for presentation.
Filtering significant correlations is helpful because it greatly cuts down on the amount of information you have to look at and try to process. Also, we can easily re-run the code with a different cutoff for p-values if we decide that some other value is more appropriate. For example:
In any case, what we learn from this section is that prior knowledge
scores tend to be negatively correlated with both interest
and
earnings
, believing your potential earnings are better than average is
positively correlated with both interest
and earnings
, and knowing
someone in or being recruited by MLM is negatively correlated with both
interest
and earnings
. Additionally, people who identify as Black,
are somewhat religious, or have higher comfort with risk tend to be more
interested in the MLM opportunity whereas those with extremely low
incomes (less than $25,000 per year) or who are not religious tend to be
less interested. Those with incomes closer the median or who are more
religious tend to think they can make more by participating in the MLM
opportunity.
This last part is another exercise is data wrangling and reshaping. Our
main regression analysis compares the means of interest
and earnings
across treatment groups, but we can observe potentially two different
competing effects of income disclosures that will not be fully measured
by a linear regression analysis. This section not only measures the
Mean
, Median
, and Sd
of each dependent variable within each
treatment group (3 x 6 x 3 = 54 total statistics), but I also test for
whether the means, medians, and variances differ significantly across
treatment groups.
distributions <-
mlm %>%
select(all_of(dependents), treatment) %>%
group_by(treatment) %>%
summarise(
across(
.fns = list(Observations = ~sum(!is.na(.x)),
Mean = ~mean(.x, na.rm = TRUE),
Median = ~median(.x, na.rm = TRUE),
Sd = ~sd(.x, na.rm = TRUE)
),
.names = "{.col}-{.fn}"
)
) %>%
pivot_longer(c(-"treatment"), names_to=c("variable","stat"), values_to="val", names_sep="-") %>%
pivot_wider(names_from=c(treatment), values_from=val)
This first pipe %>%
chain creates a data frame of all 54 summary
statistics plus the number of observations in each treatment group. The
process is nearly identical to creating the table of summary statistics
in Summary Statistics for All
Variables. In this case, I group
by treatment before calculating summary statistics. Then, the reshaping
keeps a single column of statistic names but moves the treatment groups
into separate columns. The result is a table in which each row is a
combination of variable and summary operation, with one column for each
treatment group.
distributions %>%
filter(stat == "Observations")
## # A tibble: 6 x 5
## variable stat `No Disclosure` `Company Disclosure` Graphical Disclo~1
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 interest Observations 179 185 181
## 2 earnings Observations 179 185 181
## 3 earnleast Observations 179 185 181
## 4 earnmost Observations 179 185 181
## 5 over6 Observations 179 185 181
## 6 expenses Observations 133 126 122
## # ... with abbreviated variable name 1: `Graphical Disclosure`
Let’s filter first by Observations
to see how many observations are in
each treatment group. We see 179 observations in the Control group, 185
in Treatment 1, and 181 in Treatment 2. The exception to this is again
expenses
, which has some (545 - 381 = 164) missing values spread
across treatment groups. Some of the calculations below will have to
account for this.
distributions %>%
filter(stat != "Observations")
## # A tibble: 18 x 5
## variable stat `No Disclosure` `Company Disclosure` `Graphical Disclosure`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 interest Mean 1.91 1.82 1.95
## 2 interest Median 1 1 1
## 3 interest Sd 1.49 1.46 1.60
## 4 earnings Mean 7539. 5332. 6220.
## 5 earnings Median 2500 1000 1000
## 6 earnings Sd 12759. 17725. 19642.
## 7 earnleast Mean 2225. 2096. 2118.
## 8 earnleast Median 50 100 0
## 9 earnleast Sd 7574. 13380. 8132.
## 10 earnmost Mean 35122. 162270. 69419.
## 11 earnmost Median 10000 6000 5000
## 12 earnmost Sd 108359. 790837. 201122.
## 13 over6 Mean 0.270 0.197 0.158
## 14 over6 Median 0.15 0.05 0.05
## 15 over6 Sd 0.292 0.241 0.236
## 16 expenses Mean 3165. 2714. 3516.
## 17 expenses Median 2000 1500 1200
## 18 expenses Sd 4476. 3870. 8480.
Here we see all 54 summary statistics in the table. There’s a lot going on here; what should we focus on? As the main point of this section is to observe whether these distributions vary across treatment groups, we could look through this table and see what values appear to vary across columns within the same row. The rest of this section conducts statistical tests of exactly that. We can then use p-values to see which values vary significantly across columns.
t.test()
The first summary statistics to test will be the means. I use
t.test(x~y)
where x
is one of the dependent variables and y
is a
logical vector identifying which treatment group each observation is in.
I do this three times for each variable to generate three sets of tests:
T1fTC
T2fTC
T2fT1
ttests <- list()
ttests$T1fTC <- mlm %>%
filter(treatment!="Graphical Disclosure") %>%
select(all_of(dependents)) %>%
map(~(t.test(.~mlm %>%
filter(treatment!="Graphical Disclosure") %>%
select(treatment) == "Company Disclosure"
)
)$p.value
)
ttests$T2fTC <- mlm %>%
filter(treatment!="Company Disclosure") %>%
select(all_of(dependents)) %>%
map(~(t.test(.~mlm %>%
filter(treatment!="Company Disclosure") %>%
select(treatment)=="Graphical Disclosure"
)
)$p.value
)
ttests$T2fT1 <- mlm %>%
filter(treatment!="No Disclosure") %>%
select(all_of(dependents)) %>%
map(~(t.test(.~mlm %>%
filter(treatment!="No Disclosure") %>%
select(treatment)=="Graphical Disclosure"
)
)$p.value
)
I start with an empty list and then add three lists of six p-values (one for each dependent variable). The resulting list of lists is then combined into a data frame.
To generate the p-values:
map()
to apply t.test()
to each dependent variable.
t.test()
is a formula of the form x~y
,
where x
is the dependent variable and y
is a logical vector
indicating which of the two remaining treatment groups each
observation is from.t.test()
is a list of objects; one of them is a
value named p.value
.
p.value
from the output of t.test()
part of
the function that is mapped to each dependent variable.
Condensed, the function parameter of map()
would look like
~t.test(x~y)$p.value
.map()
is then a list of p-values; each is named
after the dependent variable x
that was tested.bind_rows(ttests, .id = "compare")
## # A tibble: 3 x 7
## compare interest earnings earnleast earnmost over6 expenses
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 T1fTC 0.566 0.173 0.910 0.0316 0.00996 0.386
## 2 T2fTC 0.782 0.450 0.897 0.0446 0.0000823 0.683
## 3 T2fT1 0.404 0.650 0.985 0.123 0.121 0.342
We’ll take a closer look at the implications of these p-values later,
but for now we can notice that only earnmost
and over6
have any
t-test p-values less than 0.10.
mood.medtest()
Next I will generate a similar set of p-values from tests for whether
medians differ across treatment groups. For this purpose, I use
mood.medtest()
to perform a series of Mood’s median tests.
medtests <- list()
medtests$T1fTC <- mlm %>%
filter(treatment!="Graphical Disclosure") %>%
select(all_of(dependents)) %>%
map(~(mood.medtest(.~mlm %>%
filter(treatment!="Graphical Disclosure") %>%
select(treatment) == "Company Disclosure"
)
)$p.value
)
medtests$T2fTC <- mlm %>%
filter(treatment!="Company Disclosure") %>%
select(all_of(dependents)) %>%
map(~(mood.medtest(.~mlm %>%
filter(treatment!="Company Disclosure") %>%
select(treatment) == "Graphical Disclosure"
)
)$p.value
)
medtests$T2fT1 <- mlm %>%
filter(treatment!="No Disclosure") %>%
select(all_of(dependents)) %>%
map(~(mood.medtest(.~mlm %>%
filter(treatment!="No Disclosure") %>%
select(treatment) == "Graphical Disclosure"
)
)$p.value
)
The structure of this process is identical to that followed for t-tests
above. I only had to replace all the t.test()
calls to
mood.medtest()
.
bind_rows(medtests, .id = "compare")
## # A tibble: 3 x 7
## compare interest earnings earnleast earnmost over6 expenses
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 T1fTC 0.570 0.000164 0.639 0.00242 0.0378 0.0883
## 2 T2fTC 0.605 0.000338 0.0000380 0.0563 0.00000919 0.406
## 3 T2fT1 1 0.330 0.00000000133 0.600 0.0376 0.899
A few more tiny p-values for the medians than for the means. Again, we will return to these p-values shortly.
levene.test()
The last set of tests is for whether standard deviations differ across
treatment groups. levene.test()
performs a modified robust
Brown-Forsythe Levene-type test based on the absolute deviations from
the median (see Brown and Forsythe,
1974). Technically, this is a test of
variances rather than standard deviations but the implications are the
same.
sdtests <- list()
for (x in dependents) {
z <- mlm %>%
filter(treatment!="Graphical Disclosure") %>%
filter(!is.na(get(x)))
sdtests$T1fTC <- append(sdtests$T1fTC
, levene.test(select(z, all_of(x))[[1]]
, select(z, treatment)[[1]]
)$p.value
)
}
for (x in dependents) {
z <- mlm %>%
filter(treatment!="Company Disclosure") %>%
filter(!is.na(get(x)))
sdtests$T2fTC <- append(sdtests$T2fTC
, levene.test(select(z, all_of(x))[[1]]
, select(z, treatment)[[1]]
)$p.value
)
}
for (x in dependents) {
z <- mlm %>%
filter(treatment!="No Disclosure") %>%
filter(!is.na(get(x)))
sdtests$T2fT1 <- append(sdtests$T2fT1
, levene.test(select(z, all_of(x))[[1]]
, select(z, treatment)[[1]]
)$p.value
)
}
Just for fun, I decided to create this list in a slighly different way.
Instead of using map()
to apply the test to multiple variables, I
create each list of p-values using a for
loop. Within each loop:
z
of the full mlm
data frame that filters out
the observations from the treatment group not being compared, and
filters out missing values (this last one only applies to
expenses
).append()
, I call levene.test()
with two
parameters. The first parameter is the dependent variable x
selected from z
. The second parameter is a factor of treatment
groups selected from z
.
[[1]]
.levene.test()
is a list of objects; one of them
is a value named p.value
. I index the p-value from the output.levene.test()
is then appended to the growing
list of p-values for the current set of treatment groups being
compared.names(sdtests$T1fTC) <- dependents
names(sdtests$T2fTC) <- dependents
names(sdtests$T2fT1) <- dependents
bind_rows(sdtests, .id = "compare")
## # A tibble: 3 x 7
## compare interest earnings earnleast earnmost over6 expenses
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 T1fTC 0.566 0.214 0.895 0.0317 0.00956 0.702
## 2 T2fTC 0.782 0.623 0.900 0.0315 0.0000965 0.420
## 3 T2fT1 0.403 0.560 0.967 0.130 0.197 0.302
Unlike the map()
method above, this for
-loop method doesn’t
automatically name each p-value according the the dependent variable
that was tested. I use names()
to apply the list of dependent
variables name to the p-values in each list before combining the lists
into a data frame.
options(width = 100)
distributions <- inner_join(distributions
, bind_rows(list("Mean" = bind_rows(ttests, .id = "compare"),
"Median" = bind_rows(medtests, .id = "compare"),
"Sd" = bind_rows(sdtests, .id = "compare")
)
, .id = "stat") %>%
pivot_longer(all_of(dependents), names_to = "variable") %>%
pivot_wider(names_from = compare, values_from = value)
)
## Joining, by = c("variable", "stat")
write.csv(distributions, "descriptive/distributions.csv")
I change the width option so the entire table will display here. Then I
combine the p-values from all of the tests to the original
distributions
data frame. This—of course—requires a bit of reshaping
nonsense.
Nested within inner_join()
, I create a list of the three data frames
of p-values. I name each data frame in this list according to the
summary operation that is tested using names to match the values of
stat
in distributions
. The result is a data frame of all 54 p-values
with one column stat
identifying the summary operation, one column
compare
identifying the pairing of treatment groups being tested, and
one column of values for each dependent variable.
Before I can join this to distributions
, I need the structure to
match. I use pivot_longer()
to move the dependent variables from
columns into rows so that I have a single row of p-values each
identified by the combination of stat
, compare
, and variable
. Then
I use pivot_wider()
to move the compare
categories into columns.
Now, like in distributions
, each row is identified by the combination
of stat
and variable
and the inner_join()
will use the values of
these two variables to add three columns of p-values to the existing
distributions
data frame.
distributions
## # A tibble: 18 x 8
## variable stat `No Disclosure` `Company Disclosure` Graphical Disclos~1 T1fTC T2fTC T2fT1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 interest Mean 1.91 1.82 1.95 5.66e-1 7.82e-1 4.04e-1
## 2 interest Median 1 1 1 5.70e-1 6.05e-1 1 e+0
## 3 interest Sd 1.49 1.46 1.60 5.66e-1 7.82e-1 4.03e-1
## 4 earnings Mean 7539. 5332. 6220. 1.73e-1 4.50e-1 6.50e-1
## 5 earnings Median 2500 1000 1000 1.64e-4 3.38e-4 3.30e-1
## 6 earnings Sd 12759. 17725. 19642. 2.14e-1 6.23e-1 5.60e-1
## 7 earnleast Mean 2225. 2096. 2118. 9.10e-1 8.97e-1 9.85e-1
## 8 earnleast Median 50 100 0 6.39e-1 3.80e-5 1.33e-9
## 9 earnleast Sd 7574. 13380. 8132. 8.95e-1 9.00e-1 9.67e-1
## 10 earnmost Mean 35122. 162270. 69419. 3.16e-2 4.46e-2 1.23e-1
## 11 earnmost Median 10000 6000 5000 2.42e-3 5.63e-2 6.00e-1
## 12 earnmost Sd 108359. 790837. 201122. 3.17e-2 3.15e-2 1.30e-1
## 13 over6 Mean 0.270 0.197 0.158 9.96e-3 8.23e-5 1.21e-1
## 14 over6 Median 0.15 0.05 0.05 3.78e-2 9.19e-6 3.76e-2
## 15 over6 Sd 0.292 0.241 0.236 9.56e-3 9.65e-5 1.97e-1
## 16 expenses Mean 3165. 2714. 3516. 3.86e-1 6.83e-1 3.42e-1
## 17 expenses Median 2000 1500 1200 8.83e-2 4.06e-1 8.99e-1
## 18 expenses Sd 4476. 3870. 8480. 7.02e-1 4.20e-1 3.02e-1
## # ... with abbreviated variable name 1: `Graphical Disclosure`
We can now observe the final resulting summary table with p-values for
all of the relevant tests. First, we see that interest
and expenses
are seemingly unaffected by income disclosures as we find no significant
differences in means, medians, or variances. The changes in
distributions of various earnings estimates, however, are much more
interesting.
Notably, the mean of earnings
does not differ significantly across
control/treatment groups, but the median of earnings
is significantly
lower in each treatment group than in the “No Disclosure” group. This
means that most respondents have lower estimated earnings if they
received either version of income disclosure versus no disclosure. The
similarity in means reflects the fact that some respondents have
extremely high estimated earnings after receiving income disclosure
sufficient to offset the decrease in the estimates of the majority.
Those with higher estimates of earnings may be subject to an anchoring
effect of having seen some higher earnings numbers in the disclosures.
Evidence of these two competing effects is also manifested to varying
degrees in the distributions of the other earnings estimation variables.
Most notably, when asked the most they think they could earn, we expect
any anchoring effect to be much stronger than when asked to estimate
Typical Earnings as the most a person could earn is presumably more
dependent on the maximum earnings observed in an income disclosure.
Accordingly, the distributions of earnmost
follow a more pronounced
pattern than earnings
across control/treatment groups. The median of
earnmost
is significantly lower in either treatment group than in the
control group, but the mean is significantly higher in both treatment
groups. The number of values that are much higher than the median in
these groups is also reflected by a statistically larger variance in
each treatment than in the control group.
The distribution of earnleast
is seemingly different only in the
graphical-disclosure group. The median of earnleast
is only zero in
this second treatment group and is statistically lower than the median
in either the control group or the company-disclosure group. The
zero-dollar column in the graphical disclosure may serve as a cognitive
anchor when subjects are asked to estimate the least they might earn,
bringing both the mean and median down.
The mean, median, and variance of over6
are lower in both treatment
groups than in the control group. As the value of this variable has a
maximum at 100 percent, we are unlikely to observe the same effect of a
few relatively high values pulling up the mean as we see in the
distributions earnings
or earnmost
.