Homework 8 - Statistical transforms and aggregation
Overview: Visualization is a great way to communicate information, but as with any communication system, too much information can cause an overload! In the assignment we’ll explore visualizing a dataset that might be too large to show with the basic techniques that we started with. We’ll compare an overloaded visualization to versions that use different forms of “compression”, such as binning, dimensionality reduction or statistics. The goal will be to try to preserve as much useful information as possible, while avoiding overloading our reader.
Requirements:
Choose a public dataset that is large, either in the number of variables (> ~7-8) and/or the number of observations (> ~5000-100000).
There are no strict restrictions on what dataset you choose, but you should aim to choose a dataset that is unique and insightful.
You will likely also want to choose a dataset that requires little preprocessing if possible.
Take a look at the data resources page for inspiration.
Create a visualization of this dataset that individually encodes all of the observations/variables. That is, a user should be able to find the precise value of any variable in any observation.
If your dataset is large enough, it will likely be practically impossible. That’s the point! Just make your best effort to fufill this requirement.
This should be a static visualization (we’ll discuss using interaction/animation to help with this task later in the course).
The exact form of the visualization is up to you, do your best to choose geometries and encodings that at relevant for your chosen data and let you accomplish the stated goal.
Create a visualization of this dataset that uses some form of aggregation to compress the visual representation of the data and make it more legible.
For datasets with many observations this could be a histogram, KDE plot, etc. Consider techniques for encoding multiple variables within these types of visualizations.
For datasets with many variables this could mean applying some form of dimensionality reduction such as PCA or T-SNE. In this case make sure that it is still clear to the reader what the variables in the dataset are.
Your goal here is to strike an optimal balance between reducing the visual clutter and overplotting in the visualization, while still allowing the user to get useful and accurate takeways from the visualization.
Finally, compute a set of summary statistics for the dataset and create a visualization that uses these to convey relevant information to the reader.
This could involve several different techniques such as box-and-whisker plots, or simpler bar plots, etc.
This should be a separate visualization from the previous two, but feel free to include elements from previous visualizations for context. E.g. you could show the statistics in the context of a KDE.
What statistics you choose to visualize is up to you. This could be mean, median, (co-)variance, quantiles, etc. The goal is again to convey as much relevant and accurate information from the underlying dataset as possible.
Make sure to include clear explanatory captions for each visualzation. In this case, each caption should highlight what a reader could take away from it.
Cite and provide a link to the source of the dataset used.
You should follow the standard instructions for submitting this assignment on Canvas.