Good Visualizations For Data Science

We are a highly visual species. Visuals speak to us better than numbers and charts. Data visualisation is an easy and concise manner to convey ideas in a universal manner. Having the knowledge of a plethora of tools will enable us to present our ideas in the best way possible and help us be the most effective.

In this post we will take a look at some of the common plots and make use of tools such as matplotlib, seaborn and some other visualization frameworks to view these plots.

Line plots
The most common plot is probably the line plot. In matplotlib the default plot is the line plot which can be shown using the command plt.plot(x_values, y_values). Below is an example of the same.

The histogram is same as the line plots. They also show the distribution of data. Now the question is, when do we show line plots and when do we show histograms. According to me, histograms are a better choice when there is a lot of random swings in the data. And I think line plots shine when the changes in data across the spectrum is more smooth. Let’s take an example of a histogram.

Box plot
A box plot can be used if the underlying grouping of the data need to be seen. You get an idea about how far spaced out are the quartiles in the overall distribution and where are the outliers if they are present. Also they are non parametric and do not make any assumptions of the underlying distribution of the data. Below is an example of the box plot.

Violin plots
Violin plots are like the next level of histograms while they have the advantages of the box plots. So they show the probability density distribution of the data at different values. And like box plots, violin plots can be used to show the distribution of different categories of data. Thus, they easily show the differences between two similar groups of data. Take a look at the below example of a violin plot.

3D plots
Sometimes some combinations of variables are what is important and hence, you will need to show them in a 3d space. As human beings, of course, we cannot scale this and take this to more than 3d space as we are not able to visualize multiple dimensions simultaneously. Below is the code to show components in 3d.

In case you are interested in showing visualizations and the relationships between components where the vector space is more than 3d then take a look at this 3Blue1Brown video. The idea that is described in the video is known as parallel coordinates and is a widely used technique for showing multidimensional data. The code to implement your own parallel coordinates visualizations can be seen in this link.

Showing relationships using graphs
The plots discussed above work best when you have structured data. But in some cases there are a lot of unstructured data with interconnected relationships between them (for example Twitter or Facebook user connections). Graphs are a good way to show such interconnected relationships between different types of data. Code to show the graphs and the relationships is shown here.

If you are crunching data that is linked to spatial locations, you probably will need to show the distribution linked to some maps based on your target location. The preferred format for parsing location based information is the geojson file format. In python, geopandas is a popular library for plotting geographical information. In the below code the districts are color coded according to the states.


A list of awesome visualisations in D3.
List of visualisations.
google sankey diagrams
geopandas visualisations.

Joydeep Bhattacharjee