class: center, middle, inverse, title-slide # Some data viz advice ### Daniel Anderson ### Thursday, April 8, 2021 --- layout: true <script> feather.replace() </script> <div class="slides-footer"> <span> <a class = "footer-icon-link" href = "https://github.com/datalorax/psych-seminar21/raw/main/anderson-psych-seminar21.pdf"> <i class = "footer-icon" data-feather="download"></i> </a> <a class = "footer-icon-link" href = "https://datalorax.github.io/psych-seminar21/"> <i class = "footer-icon" data-feather="link"></i> </a> <a class = "footer-icon-link" href = "https://github.com/datalorax/psych-seminar21"> <i class = "footer-icon" data-feather="github"></i> </a> </span> </div> --- # whoami .pull-left[ * Research Assistant Professor: Behavioral Research and Teaching * Dad (two daughters: 8 and 6) * Pronouns: he/him/his * Primary areas of interest: ๐๐R๐๐, computational research, achievement gaps, systemic inequities, and variance between educational institutions ] .pull-right[ ![](img/IMG_1306.jpeg) </div> ] --- # Resources (free) .pull-left[ [Healy](http://socviz.co) <div> <img src = http://socviz.co/assets/dv-cover-pupress.jpg height = 400> </div> ] .pull-right[ [Wilke](https://serialmentor.com/dataviz/) <div> <img src = https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Febook3000.com%2Fupimg%2Fallimg%2F190330%2F0153540.jpg&f=1&nofb=1 height = 400> </div> ] --- # Other Resources * My classes! * Sequence + EDLD 651: Introductory Educational Data Science (EDS) + EDLD 652: Data Visualization for EDS + EDLD 653: Functional Programming for EDS + EDLD 654: Machine Learning for EDS + Capstone --- # Where to start? * I *really* recommend moving to R as quickly as possible -- http://r4ds.had.co.nz <div> <img src = img/r4ds.png height = 350> </div> --- # ggplot2! .pull-left[ https://r-graphics.org ![](https://r-graphics.org/cover.jpg) ] .pull-right[ Third edition [in progress](https://ggplot2-book.org)! ![](https://images-na.ssl-images-amazon.com/images/I/31uoy-qmhEL._SX331_BO1,204,203,200_.jpg) ] --- # Last note before we really start * These slides were produced with R * See the source code [here](https://github.com/datalorax/psych-seminar19) * The focus of this particular talk is not on the code itself --- class: middle center # Different ways of encoding data ![](index_files/figure-html/data-encoding-1.png)<!-- --> --- # Other elements to consider * Text + How is the text displayed (e.g., font, face, location)? + What is the purpose of the text? -- * Transparency + Are there overlapping pieces? + Can transparency help? --- # Other elements to consider * Type of data + Continuous/categorical + Which can be mapped to each aesthetic? - e.g., shape and line type can only be mapped to categorical data, whereas color and size can be mapped to either. --- class: middle center # Basic Scales ![](index_files/figure-html/scales-wilke-1.png)<!-- --> --- # Talk with a neighbor How would you encode these data into a display? | Month | Day | Location | Temperature | |:-----:|:---:|:-------------| ------------| | Jan | 1 | Chicago | 25.6 | | Jan | 1 | San Diego | 55.2 | | Jan | 1 | Houston | 53.9 | | Jan | 1 | Death Valley | 51.0 | | Jan | 2 | Chicago | 25.5 | | Jan | 2 | San Diego | 55.3 | | Jan | 2 | Houston | 53.8 | | Jan | 2 | Death Valley | 51.2 | | Jan | 3 | Chicago | 25.3 | --- class: middle center # Putting it to practice ![](index_files/figure-html/temp-change-1.png)<!-- --> --- class: middle center # Alternative representation ![](index_files/figure-html/temp-change2-1.png)<!-- --> --- # Comparison * Both represent three scales + Two position scales (x/y axis) + One color scale (categorical for the first, continuous for the second) --- # More scales are possible ![](index_files/figure-html/five-scales-1.png)<!-- --> --- background-image:url(http://socviz.co/dataviz-pdfl_files/figure-html4/ch-01-multichannel-1.png) background-size:contain Additional scales can become lost without high structure in the data --- class: inverse-blue middle # Thinking more about color --- # Three fundamental uses 1. Distinguish groups from each other 1. Represent data values 1. Highlight --- # Discrete items Often no intrinsic order -- ### Qualitative color scale * Finite number of colors + Chosen to maximize distinctness, while also being *equivalent* + Equivalent - No color should stand out - No impression of order --- background-image:url(https://serialmentor.com/dataviz/color_basics_files/figure-html/qualitative-scales-1.png) background-size:contain # Some examples .footnote[See more about the Okabe Ito palette origins [here](http://jfly.iam.u-tokyo.ac.jp/color/)] --- background-image:url(https://serialmentor.com/dataviz/color_basics_files/figure-html/sequential-scales-1.png) background-size:contain # Sequential scale examples --- background-image:url(https://serialmentor.com/dataviz/color_basics_files/figure-html/diverging-scales-1.png) background-size:contain # Diverging palettes --- # Earth palette ![](index_files/figure-html/or1-1.png)<!-- --> --- ![](index_files/figure-html/ca1-1.png)<!-- --> --- # Common problems ### Too many colors More than 5-ish generally becomes difficult to track ![](index_files/figure-html/too-many-colors-1.png)<!-- --> --- # Use labels still too many... ![](index_files/figure-html/states-labeled-1.png)<!-- --> --- # Better Get a subset ![](index_files/figure-html/subset-states-1.png)<!-- --> --- # Best (but could still be improved) ![](index_files/figure-html/best-label-1.png)<!-- --> --- # Problem w/ default palette ### For {ggplot2} ![](index_files/figure-html/colorblind1-1.png)<!-- --> --- # Alternative: viridis ![](index_files/figure-html/viridis1-1.png)<!-- --> --- # Revised version ![](index_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- # Last few note on palettes * Do some research, find what you like .bold[and] what tends to work well * Check for colorblindness * Look into http://colorbrewer2.org/ --- class: inverse-blue middle # Visualizing Uncertainty --- # The primary problem * When we see a point on a plot, we interpret it as .bolder[THE] value. ![](index_files/figure-html/loc_point-1.png)<!-- --> --- # Some secondary problem * We're not great at understanding probabilities * We regularly round probabilities to 100% or 0% * As probabilities move to the tails, we're generally worse --- # How do we typically communicate uncertainty? * Error bars ![](index_files/figure-html/error_bars-1.png)<!-- --> --- # Thinking about uncertainty Uncertainty means exactly what it sounds like - we are not 100% sure. * We are nearly always uncertain of future events (forecasting) * We can also be uncertain about past events + I saw a parked car at 8 AM, but the next time I looked at 2PM it was gone. What time did it leave? -- ### Quantifying uncertainty * We quantify our uncertainty mathematically using probability * Framing probabilities as frequencies is generally more intuitive --- class: bottom background-image:url("https://serialmentor.com/dataviz/visualizing_uncertainty_files/figure-html/probability-waffle-1.png") background-size:contain # Framing a single uncertainty --- # Non-discrete probabilities Blue party has 1% advantage w/ margin of error of 1.76 points ### Who will win? -- ![](index_files/figure-html/pdf-1.png)<!-- --> .footnote[Code taken from [Wilke](https://clauswilke.com/dataviz/)] --- # Descretized version ![](index_files/figure-html/descritized2-eval-1.png)<!-- --> --- class: inverse-red middle # Point estimates A few alternatives to error bars --- # Multiple error bars ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Density stripes ![](index_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Actual densities ![](index_files/figure-html/densities-1.png)<!-- --> --- class: inverse-red middle # HOPs ### Hypothetical Outcome Plots (and related plots) --- # Standard regression plot ![](index_files/figure-html/loess1-1.png)<!-- --> --- # Alternative ![](index_files/figure-html/bootstrap1-1.png)<!-- --> --- # HOPs Hops animate the process, so you can't settle on one "truth" ![](img/loess_hop.gif) --- # Another example ![](https://github.com/wilkelab/ungeviz/raw/master/man/figures/README-cacao-samples-anim-1.gif) --- # Last few note on uncertainty * Do try to communicate uncertainty whenever possible * I'd highly recommend watching [this](https://youtu.be/E1kSnWvqCw0) talk by Matthew Kay on the topic --- class: inverse-blue middle # Data ink ratio --- # What is it? -- > ### Above all else, show the data <br> \-Edward Tufte -- * Data-Ink Ratio = Ink devoted to the data / total ink used to produce the figure -- * Common goal: Maximize the data-ink ratio --- # Example ![](img/six-boxplots.png) -- * First thought might be... Cool! --- background-image:url(https://theamericanreligion.files.wordpress.com/2012/10/lee-corso-sucks.jpeg?w=660) background-size:cover --- # Minimize cognitive load * Empirically, Tufte's plot was .bolder[the most difficult] for viewers to interpret. -- * Visual cues (labels, gridlines) reduce the data-ink ratio, but can also reduce cognitive load. --- # An example ### Which do you prefer? .pull-left[ ![](index_files/figure-html/h3_bad-1.png)<!-- --> ] .pull-right[ ![](index_files/figure-html/h3_good-1.png)<!-- --> ] --- # Advice from Wilke > Whenever possible, **visualize your data with solid, colored shapes** rather than with lines that outline those shapes. Solid shapes are more easily perceived, are less likely to create visual artifacts or optical illusions, and do more immediately convey amounts than do outlines. .gray[emphasis added] --- # Another example .pull-left[ ![](index_files/figure-html/iris_lines-1.png)<!-- --> ] .pull-right[ ![](index_files/figure-html/iris_colored_lines-1.png)<!-- --> ] --- class: center middle ![](index_files/figure-html/iris_filled-1.png)<!-- --> --- # Labels in place of legends Prior slide is a great example of when annotations can be used in place of a legend to * reduce cognitive load * increase clarity * increase beauty * maximize the figure size --- class: inverse-red middle # Practical advice so far --- class: inverse-red middle ## Include uncertainty wherever possible <br> -- ## Avoid line drawings <br> -- ## Maximize the data-ink ratio within reason (but preference reduction of cognitive load) --- class: inverse-red middle ## Use color to your advantage (and think critically about the palettes you choose) <br> -- ## Consider plot annotations over legends --- class: inverse-blue middle # Grouped data --- # Distributions How do we display more than one distribution at a time? --- # Boxplots ![](index_files/figure-html/boxplots-1.png)<!-- --> --- # Violin plots ![](index_files/figure-html/violin-1.png)<!-- --> --- # Jittered points ![](index_files/figure-html/jittered-1.png)<!-- --> --- # Sina plots ![](index_files/figure-html/sina-1.png)<!-- --> --- # Stacked histograms ![](index_files/figure-html/stacked-histo-1.png)<!-- --> --- # Overlapping densities ![](index_files/figure-html/overlap-dens-1.png)<!-- --> --- # Ridgeline densities ![](index_files/figure-html/ridgeline-1.png)<!-- --> --- class: inverse-red middle # Quick empirical examples --- # Titanic data ``` ## # A tibble: 1,313 x 5 ## name class age sex survived ## <chr> <chr> <dbl> <chr> <int> ## 1 Allen, Miss Elisabeth Walton 1st 29 female 1 ## 2 Allison, Miss Helen Loraine 1st 2 female 0 ## 3 Allison, Mr Hudson Joshua Creighton 1st 30 male 0 ## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25 female 0 ## 5 Allison, Master Hudson Trevor 1st 0.92 male 1 ## 6 Anderson, Mr Harry 1st 47 male 1 ## # โฆ with 1,307 more rows ``` --- # Boxplots ![](index_files/figure-html/boxplots-empirical-1.png)<!-- --> --- # Violin plots ![](index_files/figure-html/violin-empirical-1.png)<!-- --> --- # Jittered point plots ![](index_files/figure-html/jittered-empirical-1.png)<!-- --> --- # Sina plot ![](index_files/figure-html/sina-empirical-1.png)<!-- --> --- # Stacked histogram ![](index_files/figure-html/stacked-histo-empirical-1.png)<!-- --> -- .realbig[๐คจ] --- # Dodged ![](index_files/figure-html/dodged-histo-empirical-1.png)<!-- --> --- # Better ![](index_files/figure-html/wrapped-histo-empirical-1.png)<!-- --> --- # Overlapping densities ![](index_files/figure-html/overlap-dens-empirical-1.png)<!-- --> -- Note the default colors really don't work well in most of these --- class: center middle ![](index_files/figure-html/overlap-dens-empirical2-1.png)<!-- --> --- # Ridgeline densities ![](index_files/figure-html/ridgeline-dens-empirical-1.png)<!-- --> --- class: inverse-blue middle # Visualizing amounts --- # Bar plots ![](index_files/figure-html/bars-1.png)<!-- --> --- # Flipped bars ![](index_files/figure-html/flipped_bars-1.png)<!-- --> --- # Dotplot ![](index_files/figure-html/dots-1.png)<!-- --> --- # Heatmap ![](index_files/figure-html/heatmap-1.png)<!-- --> --- class: inverse-red middle # A short journey ### How much does college cost? --- # Tuition data ``` ## # A tibble: 6 x 13 ## State `2004-05` `2005-06` `2006-07` `2007-08` `2008-09` `2009-10` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alabama 5682.838 5840.550 5753.496 6008.169 6475.092 7188.954 ## 2 Alaska 4328.281 4632.623 4918.501 5069.822 5075.482 5454.607 ## 3 Arizona 5138.495 5415.516 5481.419 5681.638 6058.464 7263.204 ## 4 Arkansas 5772.302 6082.379 6231.977 6414.900 6416.503 6627.092 ## 5 California 5285.921 5527.881 5334.826 5672.472 5897.888 7258.771 ## 6 Colorado 4703.777 5406.967 5596.348 6227.002 6284.137 6948.473 ## # โฆ with 6 more variables: 2010-11 <dbl>, 2011-12 <dbl>, 2012-13 <dbl>, ## # 2013-14 <dbl>, 2014-15 <dbl>, 2015-16 <dbl> ``` --- # By state: 2015-16 ![](index_files/figure-html/state-tuition1-1.png)<!-- --> -- .realbig[๐คฎ๐คฎ๐คฎ] --- # Two puke emoji version .pls[ .realbig[๐คฎ๐คฎ] ] .prb[ ![](index_files/figure-html/state-tuition2-1.png)<!-- --> ] --- # One puke emoji version .pls[ .realbig[๐คฎ] ] .prb[ ![](index_files/figure-html/state-tuition3-eval-1.png)<!-- --> ] --- # Kinda smiley version .pls[ .realbig[๐] ] .prb[ ![](index_files/figure-html/state-tuition4-eval-1.png)<!-- --> ] --- # Highlight Oregon .pls[ .realbig[๐] ] .prb[ ![](index_files/figure-html/oregon-highlight-eval-1.png)<!-- --> ] --- # Not always good to sort ![](index_files/figure-html/income_df-sorted-1.png)<!-- --> --- # Much better ![](index_files/figure-html/income_df-1.png)<!-- --> --- # Heatmap ![](index_files/figure-html/heatmap2-eval-1.png)<!-- --> --- # Better heatmap ![](index_files/figure-html/heatmap3-eval-1.png)<!-- --> --- # Even better? ![](index_files/figure-html/heatmap5-eval-1.png)<!-- --> --- # Or maybe this one? ![](index_files/figure-html/heatmap4-eval-1.png)<!-- --> --- background-image: url("img/heatmap.png") class: inverse-blue background-size:contain --- # Quick aside * Think about the data you have * Given that these are state-level data, they have a geographic component --- background-image: url("img/states-heatmap.png") class: inverse bottom background-size:contain --- class: inverse-blue center middle # Some things to avoid --- # Line drawings ### As discussed earlier .pull-left[ ### ๐ซ ![](index_files/figure-html/dens-titanic-1.png)<!-- --> ] .pull-right[ ### Change the fill ![](index_files/figure-html/dens-titanic-blue-1.png)<!-- --> ] --- class: middle .pull-left[ ![](index_files/figure-html/iris_lines2-1.png)<!-- --> ] .pull-right[ ![](index_files/figure-html/iris_filled2-1.png)<!-- --> ] --- # Much worse ### Unnecessary 3D .pull-left[ ![](img/3d-pie-10-v2.png) ] .pull-right[ ![](img/3d-pie-20-v2.png) ] --- # Much worse ### Unnecessary 3D .pull-left[ ![](img/3d-pie-40-v2.png) ] .pull-right[ ![](img/3d-pie-80-v2.png) ] --- # Horrid example ### Used relatively regularly ![](img/3d_bar.png) --- # Pie charts ### Especially w/lots of categories ![](https://serialmentor.com/dataviz/visualizing_proportions_files/figure-html/marketshare-pies-1.png) --- # Alternative representation ![](https://serialmentor.com/dataviz/visualizing_proportions_files/figure-html/marketshare-side-by-side-1.png) --- # A case for pie charts * `\(n\)` categories low, * differences are relatively large * familiar for some audiences ![](index_files/figure-html/pie-1.png)<!-- --> --- # The anatomy of a pie chart Pie charts are just stacked bar charts with a radial coordinate system ![](index_files/figure-html/stacked_bars_nopie-1.png)<!-- --> --- # Horizontal ![](index_files/figure-html/horiz_stacked-1.png)<!-- --> --- # My preference ![](index_files/figure-html/dodged_bars-1.png)<!-- -->![](index_files/figure-html/dodged_bars-2.png)<!-- --> --- # Dual axes * Exception: if second axis is a direct transformation of the first ![](img/dual_axes.png) .footnote[See many examples [here](http://www.tylervigen.com/spurious-correlations) --- # Truncated axes ![](img/truncated_axes.png) --- class: middle ![](img/truncated_axes2.png) --- # Not always a bad thing > It is tempting to lay down inflexible rules about what to do in terms of producing your graphs, and to dismiss people who donโt follow them as producing junk charts or lying with statistics. But being honest with your data is a bigger problem than can be solved by rules of thumb about making graphs. In this case there is a moderate level of agreement that bar charts should generally include a zero baseline (or equivalent) given that bars encode their variables as lengths. But it would be a mistake to think that a dot plot was by the same token deliberately misleading, just because it kept itself to the range of the data instead. --- class: middle ![](http://socviz.co/dataviz-pdfl_files/figure-html4/ch-01-law-enrollments-1.png) --- class: middle ![](http://socviz.co/dataviz-pdfl_files/figure-html4/ch-01-law-enrollments-2.png) --- # Scaling issues ![](img/area_size.png) --- class: middle center # Poor binning choices ![](img/poor_binning.png) --- class: inverse-blue middle # Conclusions ## Practical takeaways to make better visualizations --- 1. Avoid line drawings 2. Sort bar charts in ascending/descending order as long as the other axis does not have implicit meaning 3. Consider dropping legends and using annotations, when possible 4. Use color to your advantage, but be sensitive to color-blindness, and use the right kind of palette 5. Consider double-encoding data (shapes and color) 6. Make your labels bigger! .gray[Didn't talk about this one much but it's super common and really important] --- # Some things to avoid * Essentially never + Use dual axes (produce separate plots instead) + Use 3D unnecessarily * Be wary of + Truncated axes + Pie charts (particularly with lots of categories) --- class: inverse-green middle # Thanks! ## Questions? <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M23 3a10.9 10.9 0 0 1-3.14 1.53 4.48 4.48 0 0 0-7.86 3v1A10.66 10.66 0 0 1 3 4s-4 9 5 13a11.64 11.64 0 0 1-7 2c9 5 20 0 20-11.5a4.5 4.5 0 0 0-.08-.83A7.72 7.72 0 0 0 23 3z"></path></svg> [@datalorax_](https://twitter.com/datalorax_) <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M9 19c-5 1.5-5-2.5-7-3m14 6v-3.87a3.37 3.37 0 0 0-.94-2.61c3.14-.35 6.44-1.54 6.44-7A5.44 5.44 0 0 0 20 4.77 5.07 5.07 0 0 0 19.91 1S18.73.65 16 2.48a13.38 13.38 0 0 0-7 0C6.27.65 5.09 1 5.09 1A5.07 5.07 0 0 0 5 4.77a5.44 5.44 0 0 0-1.5 3.78c0 5.42 3.3 6.61 6.44 7A3.37 3.37 0 0 0 9 18.13V22"></path></svg> [@datalorax](https://github.com/datalorax) <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <circle cx="12" cy="12" r="10"></circle> <line x1="2" y1="12" x2="22" y2="12"></line> <path d="M12 2a15.3 15.3 0 0 1 4 10 15.3 15.3 0 0 1-4 10 15.3 15.3 0 0 1-4-10 15.3 15.3 0 0 1 4-10z"></path></svg> [website](www.datalorax.com) <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M4 4h16c1.1 0 2 .9 2 2v12c0 1.1-.9 2-2 2H4c-1.1 0-2-.9-2-2V6c0-1.1.9-2 2-2z"></path> <polyline points="22,6 12,13 2,6"></polyline></svg> [daniela@uoregon.edu](mailto:daniela@uoregon.edu)