## Introduction

The first, and probably the most important, step of a data scientist is to explore and manipulate the data is about to work. Hence, the task of Exploratory Data Analysis, or EDA, is used to take insights from the data. By doing so, it enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

In order to perform EDA, data manipulation techniques and multiple statistical tools exist.

The first, and probably the most important, step of a data scientist is to explore and manipulate the data is about to work. Hence, the task of Exploratory Data Analysis, or EDA, is used to take insights from the data. By doing so, it enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

In order to perform EDA, data manipulation techniques and multiple statistical tools exist.

In this resource we will describe some of the most used tools for exploring and manipulating data.

## Description of EDA Tools

The first step in EDA usually involves the initial data import in order to manipulate them. The library that most users will test for this task is Pandas (1).

### Pandas

**Pandas** provides high-level data structures and
functions designed to make working with data intuitive and flexible.
Pandas emerged in 2008 (official release in 2010) from developer Wes
McKinney to manipulate trading data. Since then it helped enable Python
to be a powerful and productive data analysis environment. The primary
objects used in Pandas are DataFrame, a tabular, column-oriented data
structure with both row and column labels, and the Series, a
one-dimensional labeled array object. It combines the array-computing
ideas of NumPy with the kinds of data manipulation capabilities found in
spreadsheets and relational databases, such as SQL. It also provides
convenient indexing functionality (enabling reshaping, slicing and
dicing, aggregations and selecting of data subsets).

In order to manipulate data in deeper way, data scientists will use specialized libraries; like NumPy (2) and SciPy (3).

### NumPy

**NumPy**, short for Numerical Python, has been the
pillar of numerical computing in Python. It provides data structures and
algorithms need for most scientific applications involving numerical
data. NumPy contains, among other things: a fast and efficient
multidimensional array object, performing element-wise computations with
arrays or mathematical operations between arrays, linear algebra
operations, Fourier transform, and random number generation, a mature C
API to enable Python extensions.

One of the primary uses of NumPy in data analysis is as a container for data to be passed between

algorithms and libraries. For numerical data, NumPy arrays are more efficient for storing and manipulating than the other built Python data structures. Thus, many numerical tools for Python either assume NumPy arrays as a primary data structure or else target interoperability with NumPy.

### SciPy

**SciPy** on the other hand; is a collection of packages
addressing a number of foundational problems in scientific computing.
It contains various modules like numerical integration routines and
differential equation solvers, linear algebra routines and matrix
decompositions (extending beyond NumPy), signal processing tools, sparse
matrices and sparse linear system solvers, standard continuous and
discrete probability distribution, various statistical tests and
descriptive statistics.

Together, NumPy and SciPy form a reasonably complete and mature computational foundation for many traditional scientific computing applications.

### Scikit-Learn

Since its creation in 2007, scikit-learn has become the foremost general-purpose machine learning toolkit for Python programmers (4). It includes submodules for almost all models like:

- Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
- Regression: Lasso, ridge regression, etc.
- Clustering: k-means, spectral clustering, etc.
- Dimensionality reduction: PCA, feature selection, matrix factorization, etc.
- Model selection: Grid search, cross-validation, metrics
- Preprocessing: Feature extraction, normalization

Along with Pandas, scikit-learn has been critical for enabling Python to be a productive data science programming language.