Effective Techniques for Visualizing Skewed Data Distributions
Written on
Skewed data denotes a distribution that is significantly uneven, where most data points cluster on one side, either with a long tail extending to the right (right-skewed) or to the left (left-skewed). This long tail can dominate the visual representation, pushing the majority of data points to the edge of the plot, which complicates the identification of the underlying patterns within the bulk of the data.
In this article, I provide a summary of effective methods to visualize skewed distributions.
The graphics presented here are generated using R, with most of the source code accessible unless otherwise noted. You can also explore a collection of inspiring data visualizations at DataBrewer.co, where R code and datasets are available.
Using Transparency, Open Shapes, Colors, and Marginal Distributions to Reveal Patterns
The scatterplot on the left illustrates the relationship between housing sales and prices. The sparse points at higher sales push most data to the left edge of the graph, resulting in significant clustering and obscuring the overall distribution.
The enhanced scatterplot on the right employs several straightforward techniques to illustrate data skewness and uncover the underlying structure:
- Reduced point size and open shapes (like circles without fill) help lessen data overlap.
- Transparency in data points reveals the degree of clustering.
- Colors serve as a tool to highlight patterns. In this case, differentiating cities by color emphasizes their substantial effect on sales and prices, a key source of data variability.
- Marginal visualizations of univariate distributions reveal sources of skewness. Here, the distributions of sales (x-axis) and prices (y-axis) are shown as barcode images and density plots, indicating that sales are the primarily skewed variable.
Facet the Data into Separate Subplots for Each Source of Skewness
The preceding plot suggests that city is a critical factor affecting data skewness. To address this, we can create subplots for each city, each with its own axis scales (labels omitted for clarity).
The subplot for each city clarifies skewness and distinctly illustrates the relationship between price and sales. While we successfully identified city as an influential source of skewness and minimized it through individual subplots, this approach is not always feasible for other datasets. Additionally, persistent outliers within a single variable may not be resolved through subplots.
In the next section, I will showcase techniques to visualize skewed datasets directly, without relying on subplots for city variables.
Utilizing a 2D Histogram to Visualize Skewed Data
A 2D histogram effectively addresses skewness in a scatterplot. It creates n intervals/bins along the x and y axes, dividing the plot into a grid of n × n cells. The count of data points in each cell is then represented on a color scale, with higher bin counts yielding greater resolution.
Creating a 2D Density Plot
A 2D density plot, akin to the 2D histogram, estimates the probability density function (PDF) of the continuous data instead of counting discrete intervals. It can be depicted as continuous contours or discrete grids.
The density plot on the left highlights the concentration of data points at the bottom left, while the sparse points with minimal density blend into the background.
To enhance visibility of sparse data, I applied a cubic root transformation to the density values, mapping them to the color scale. This transformation elevates fractional density values, allowing the concentrated data center and sparse regions to stand out.
Mathematical Transformations of Data Reflected in Color
The previous density plot exemplifies how mapping mathematically transformed data to colors can effectively reveal hidden structures. Below is a notable example.
The heatmaps illustrate the African population density per square kilometer, where minimum density is zero (e.g., much of Egypt) and non-zero values range significantly. The density is represented using the classic viridis color scale.
Without data transformation, the map appears nearly blacked out, as scattered large values are overshadowed by smaller numbers.
Logarithmic transformation helps normalize the data distribution, facilitating better visualization and creating clearer patterns, such as the dense population along the Nile. Areas with population zero are greyed out since logarithms cannot define these values, which are treated as missing.
Pseudo-logarithmic transformation applies logarithmic scaling for larger numbers while transitioning to a linear scale near zero. This technique produces a detailed heatmap, highlighting populous regions and the vast inhospitable Sahara desert.
In comparison, root transformation with a higher base (e.g., 7-10) can yield similar visual effects, though it may blackout certain areas due to its inflation of fractional numbers while leaving zeros unchanged.
Note: Besides transforming data mathematically, you can maintain original data while strategically adjusting its color associations and emphasizing or downplaying outliers. For example, heatmaps of infectious disease incidences in the U.S. before and after vaccine introduction use a broad color spectrum for high incidences and a limited range for lower incidences, effectively highlighting vaccination impact.
Mathematical Transformations of Data Reflected on Axes
Returning to the housing sale-price example, the plots below demonstrate a logarithmic transformation of the x-axis. This separation allows clustered data points to spread apart, revealing their structure. For contrast, I created two plots from the same data: a scatterplot with neighboring points mapped to a color scale and a 2D histogram with probability density contours.
Another example showcases the increase in GDP and human life expectancy from the 1800s to 2015, using axial logarithmic transformation to clarify a previously clustered dataset. The x-axis is on a log10 scale, with original data values annotated.
Note that not all mathematical transformations alter distribution profiles. For instance, common methods like standardization into z-scores or normalization to a range of [0, 1] do not change distribution patterns in this context and are ineffective for visualizing skewed data.
Zooming into Clustered Data Regions
A practical approach is to initially display the entire dataset, then zoom in on the most clustered areas for detailed examination.
Finally, I summarize these valuable strategies below: