Exploratory Data Analysis

Introduction

The first, and probably the most important, step of a data scientist is to explore and manipulate the data is about to work. Hence, the task of Exploratory Data Analysis, or EDA, is used to take insights from the data. By doing so, it enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

In order to perform EDA, data manipulation techniques and multiple statistical tools exist.

The first, and probably the most important, step of a data scientist is to explore and manipulate the data is about to work. Hence, the task of Exploratory Data Analysis, or EDA, is used to take insights from the data. By doing so, it enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

In order to perform EDA, data manipulation techniques and multiple statistical tools exist.

In this resource we will describe some of the most used tools for exploring and manipulating data.

Description of EDA Tools

The first step in EDA usually involves the initial data import in order to manipulate them. The library that most users will test for this task is Pandas (1).

Pandas

Pandas provides high-level data structures and functions designed to make working with data intuitive and flexible. Pandas emerged in 2008 (official release in 2010) from developer Wes McKinney to manipulate trading data. Since then it helped enable Python to be a powerful and productive data analysis environment. The primary objects used in Pandas are DataFrame, a tabular, column-oriented data structure with both row and column labels, and the Series, a one-dimensional labeled array object. It combines the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases, such as SQL. It also provides convenient indexing functionality (enabling reshaping, slicing and dicing, aggregations and selecting of data subsets).

In order to manipulate data in deeper way, data scientists will use specialized libraries; like NumPy (2) and SciPy (3).

NumPy

NumPy, short for Numerical Python, has been the pillar of numerical computing in Python. It provides data structures and algorithms need for most scientific applications involving numerical data. NumPy contains, among other things: a fast and efficient multidimensional array object, performing element-wise computations with arrays or mathematical operations between arrays, linear algebra operations, Fourier transform, and random number generation, a mature C API to enable Python extensions.

One of the primary uses of NumPy in data analysis is as a container for data to be passed between

algorithms and libraries. For numerical data, NumPy arrays are more efficient for storing and manipulating than the other built Python data structures. Thus, many numerical tools for Python either assume NumPy arrays as a primary data structure or else target interoperability with NumPy.

SciPy

SciPy on the other hand; is a collection of packages addressing a number of foundational problems in scientific computing. It contains various modules like numerical integration routines and differential equation solvers, linear algebra routines and matrix decompositions (extending beyond NumPy), signal processing tools, sparse matrices and sparse linear system solvers, standard continuous and discrete probability distribution, various statistical tests and descriptive statistics.

Together, NumPy and SciPy form a reasonably complete and mature computational foundation for many traditional scientific computing applications.

Scikit-Learn

Figure 1. Some examples of models included in Scikit-Learn.

Since its creation in 2007, scikit-learn has become the foremost general-purpose machine learning toolkit for Python programmers (4). It includes submodules for almost all models like:

  • Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
  • Regression: Lasso, ridge regression, etc.
  • Clustering: k-means, spectral clustering, etc.
  • Dimensionality reduction: PCA, feature selection, matrix factorization, etc.
  • Model selection: Grid search, cross-validation, metrics
  • Preprocessing: Feature extraction, normalization

Along with Pandas, scikit-learn has been critical for enabling Python to be a productive data science programming language.

References

  1. https://pandas.pydata.org/
  2. https://numpy.org/
  3. https://scipy.org/
  4. https://scikit-learn.org/stable/