```{r} library(tidyverse) tips <- read_csv("http://ggobi.org/book/data/tips.csv") ``` ] --- class: informative # What is tipping? - When you're dining at a full-service restaurant - Tip 20 percent of your full bill. - When you grab a cup of coffee - Round up or add a dollar if youâ€™re a regular or ordered a complicated drink. - When you have lunch at a food truck - Drop a few dollars into the tip jar, but a little less than you would at a dine-in spot. - When you use a gift card - Tip on the total value of the meal, not just what you paid out of pocket. .footnote[[The basic rules of tipping that everyone should know about](https://www.washingtonpost.com/news/going-out-guide/wp/2016/09/15/tipping-can-be-complicated-these-are-the-basic-rules-you-should-know-about/)] --- # Recommended procedure in the book - *Step 1*: Develop a model - Should the response be `tip` alone and use the total bill as a predictor? - Should you create a new variable `tip rate` and use this as the repsonse? - *Step 2*: Fit the full model with sex, smoker, day, time and size as predictors - *Step 3*: Refine model: Should some variables should be dropped? - *Step 4*: Check distribution of residuals - *Step 5*: Summarise the model, if X=something, what would be the expected tip --- # Step 1 Calculate tip % as tip/total bill $\times$ 100

.font_small[ ```{r echo = TRUE} tips <- tips %>% mutate(tip_pct = tip/totbill * 100) #<< ``` ] --- # Step 2 Fit Fit the full model with all variables

.font_small[ ```{r echo = TRUE} tips_lm <- tips %>% select(tip_pct, sex, smoker, day, time, size) %>% lm(tip_pct ~ ., data=.) #<< ``` ] --- # Step 2 Model summary .pull-left[ .font_small[ ```{r modela, echo=TRUE, results="hide"} library(broom) library(kableExtra) tidy(tips_lm) %>% #<< kable(digits=2) %>% kable_styling() ``` ```{r modelb, echo=TRUE, results="hide"} glance(tips_lm) %>% #<< select(r.squared, statistic, p.value) %>% kable(digits=3) ``` ] ] .pull-right[ ```{r ref.label="modela", echo=FALSE} ```

```{r ref.label="modelb", echo=FALSE} ``` ] --- class: poll middle center font_large `r emo::ji("thinking")` Which variable(s) would be considered important for predicting tip %? --- # Step 3: Refine model .pull-left[ .font_small[ ```{r model_smalla, echo=TRUE, results='hide'} tips_lm <- tips %>% select(tip_pct, size) %>% #<< lm(tip_pct ~ ., data=.) tidy(tips_lm) %>% #<< kable(digits=2) %>% kable_styling() ``` ```{r model_smallb, echo=TRUE, results='hide'} glance(tips_lm) %>% #<< select(r.squared, statistic, p.value) %>% kable(digits=3) ``` ] ] .pull-right[

```{r ref.label="model_smalla", echo=FALSE} ```

```{r ref.label="model_smallb", echo=FALSE} ``` ] --- # Model summary

$$\widehat{tip %} = 18.44 - 0.92 \times size$$ --

As the size of the dining party increases by one person the tip decreases by approximately 1%. --- # Model assessment

$R^2 = 0.02$. --

This dropped by half from the full model, even though no other variables contributed significantly to the model. It might be a good step to examine interaction terms. --- class: poll middle What does $R^2 = 0.02$ mean? --- class: middle $R^2 = 0.02$ means that size explains just 2% of the variance in tip %. This is a very weak model. --

And $R^2 = 0.04$ is also a very weak model. --- class: poll middle What do the $F$ statistic and $p$-value mean? --

What do the $t$ statistics and $p$-value associated with model coeficients mean? --- # Overall model significance Assume that we have a random sample from a population. Assume that the model for the population is $$ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size $$ and we have observed $$ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size $$ The $F$ statistic refers to $$ H_o: \beta_1 = ... = \beta_7 = 0 ~~ vs ~~ H_a: \text{at least one is not 0}$$ The $p$-value is the probability that we observe the given $F$ value or larger, computed assuming $H_o$ is true. --- # Term significance Assume that we have a random sample from a population. Assume that the model for the population is $$ \widehat{tip %} = \beta_0 + \beta_1 sexM + ... + \beta_7 size $$ and we have observed $$ \widehat{tip %} = b_0 + b_1 sexM + ... + b_7 size $$ The $t$ statistics in the coefficient summary refer to $$ H_o: \beta_k = 0 ~~ vs ~~ H_a: \beta_k \neq 0 $$ The $p$-value is the probability that we observe the given $t$ value or more extreme, computed assuming $H_o$ is true. --- # Model diagnostics (MD) Normally, the final model summary would be accompanied diagnostic plots - .monash-blue2[observed vs fitted values] to check strength and appropriateness of the fit - .monash-blue2[univariate plot, and normal probability plot, of residuals] to check for normality - in the simple final model like this, the .monash-blue2[observed vs predictor], with model overlaid would be advised to assess the model relative to the variability around the model - when the final model has more terms, using a .monash-blue2[partial dependence plot] to check the relative relationship between the response and predictors would be recommended. --- # Residual plots .pull-left[ .font_small[ ```{r res_hist, echo=TRUE, fig.show='hide'} tips_aug <- augment(tips_lm) ggplot(tips_aug, aes(x=.resid)) + #<< geom_histogram() + xlab("residuals") ``` ]] .pull-right[ ```{r res_hist2, ref.label="res_hist", echo=FALSE} ``` ] --- # Residual normal probability plots .pull-left[ .font_small[ ```{r res_qq, echo=TRUE, fig.show='hide'} ggplot(tips_aug, aes(sample=.resid)) + #<< stat_qq() + stat_qq_line() + xlab("residuals") + theme(aspect.ratio=1) ``` ]] .pull-right[ ```{r res_qq2, ref.label="res_qq", echo=FALSE} ``` ] --- # Fitted vs observed .pull-left[ .font_small[ ```{r obs_fitted, echo=TRUE, fig.show='hide'} ggplot(tips_aug, aes(x=.fitted, y=tip_pct)) + #<< geom_point() + geom_smooth(method="lm") + xlab("observed") + ylab("fitted") ``` ]] .pull-right[ ```{r obs_fitted2, ref.label="obs_fitted", echo=FALSE} ``` ] --- # Model in the data space .pull-left[ .font_small[ ```{r fitted_model, echo=TRUE, fig.show='hide'} ggplot(tips_aug, aes(x=size, y=tip_pct)) + #<< geom_point() + geom_smooth(method="lm") + ylab("tip %") ``` ]] .pull-right[ ```{r fitted_model2, ref.label="fitted_model", echo=FALSE} ``` ] --- background-image: \url(https://i.stack.imgur.com/EVbuD.gif) background-size: 15% background-position: 80% 90% class: informative ## The result of this work would leave us with

a model that could be used to impose a dining/tipping policy in restaurants (see [here](https://travel.stackexchange.com/questions/40543/can-i-refuse-to-pay-auto-gratuity-in-a-restaurant)) --

and should also leave us with an unease that this policy is based on weak support. --- class: transition middle animated slideInLeft ## Plots as we have just seen, associated with pursuit of an answer to a specific question may be best grouped into the category of "initial data analysis (IDA)" or "model diagnostics (MD)".

.orange[Stay tuned for more on this area later.] --- ```{r endslide, child="assets/endslide.Rmd"} ```