Data Visualization in Data Science

Introduction

A huge amount of data is generated every day (over 2.5 quintillion bytes of data produced every day according to a recent survey (1). The best way to make sense of all this data is to visualize it in order to organized the information.

In the world of Data Science, multiple solutions exist. We will present two of the most popular way to visualize data in this resource.

Description of data visualization

Through data visualization, information can by graphically represented and summarized, with charts, plots, animations, infographics and more.

Data visualization permits easily shared informations, patterns and relationships. Multiple tools exists to fulfill this goal.

First of all, tools self-sufficient like Microsoft Power BI or Tableau will not require any code in order to manipulate and represent the data. Second of all, code-based tools, mostly using Python or R programming languages; are widely used to visualize data. Among these languages Python is the most represented and arguably the ‘best’ language to work with. The most popular libraries for data visualization are:

  • matplotlib
  • seaborn
  • plotly
  • bokeh
  • ggplot

Most of these libraries have common set of functionalities and some more specialized one, specific to one particular library.

This resource will focus mainly on 2 libraries: matplotlib and bokeh.

MatPlotLib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python (2).

It is able to incorporate other libraries, like Pandas (used for data manipulation and analysis) or NumPy (collection of high-level mathematical functions to work with arrays and numerical computing tools) to work seamlessly with transformed data.

This versatility is also represented by the wide choice of plots available from basic plots (like scatter plots, bar plots; stern and step plots, stackplots, etc), to arrays plots, statistical plots and even 3D plots.

Figure 1. a scatter plot representation in matplotlib with the python code to generate it.

Some other libraries to visualize data; like seaborn (3), are based on matplotlib and provide a high-level interface for drawing attractive and informative statistical graphics. By using matplotlib core it can focuses on creating other, more developed, features usually lacking in matplotlib.

Figure 2. A part of the example gallery of plots available in seaborn.

Bokeh

Like matplotlib, bokeh is also a Python library to visualize data (4). It is capable of incorporating an entire ecosystem of analytics tools, like NumPy, Pandas, Scipy, Scikit-Learn and more.

One of the unique and powerful capability of bokeh is the possibility to produce interactive and shareable plots and dashboards, that can be published online or in Jupyter Notebooks. You could also add custom JavaScript to support advanced or specialized cases.

Figure 3. Microscopium, a interactive dashboard maintained by researchers at Monash University.

By using bokeh, we can easily share results with colleagues, and work efficiently on shared projects.

References

  1. https://cloudtweaks.com/2015/03/how-much-data-is-produced-every-day/
  2. https://matplotlib.org/
  3. https://seaborn.pydata.org/
  4. https://bokeh.org/