**Data Science is the science of collecting, storing, processing, describing and modelling data.**

Data Science is assortment of the following.

**1. Collecting data**

In many cases, data scientists work with existing data sets collected in the course of other investigations. But the way that data is gathered and stored can limit the questions that may be answered from it and relevant data is not always immediately available. If the data is not existing, it is needed to design experiments for collecting data.

**Skills required for collecting data:**

i) Intermediate level Programming

ii) Knowledge of databases

iii) Knowledge of Statistics

**2. Storing data**

Simple and structured data (Example: Employee details) can be stored using relational databases.

When data is available in multiple databases, it should be integrated into common repository such as Data Warehouses to support analytics. Data warehouses stores structured and curated data from multiple databases, which is optimized for data analytics.

Large amount of unstructured data (text, image, video and speech) is generated in the past decade. These unstructured and uncurated data are stored in Data Lakes. Data Lakes can store both structured and unstructured big data.

**Skills required for storing data:**

i) Programming and Engineering

ii) Knowledge of Relational Databases

iii) Knowledge of NoSQL Databases

iv) Knowledge of Data Warehouses

v) Knowledge of Data Lakes (Hadoop)

**3. Processing data**

Due to anomalies found in data and common necessity of cleaning up messy raw data, the data scientist will have to “wrangle” it before moving further into the modeling process.

Data cleaning involves filling missing values, standardizing keywords, correcting spelling errors, identifying and removing outliers.

Also known as “munging” this hard-to-define step is one of the ways that data scientists make the magic happen—bringing skills and intuition to bear to take messy, incoherent information and shuffle it into clean, accessible sets.

Processing data also includes scaling, normalizing, and standardizing.

**Skills required for processing data:**

i) Programming skills

ii) Map reduce (Hadoop)

iii) SQL and NoSQL databases

iv) Basic Statistics

**4. Describing data**

Describing data includes visualizing data and summarizing data which is dealt with descriptive statistics.

For summarizing data mean, median, mode, standard deviation and variance are used.

**Skills required for describing data:**

i) Statistics

ii) Excel

iii) Python

iv) R

v) Tableau

**5. Modelling data**

For modelling data, statistical modelling and algorithmic modelling are used.

Statistical modelling is used for simple and intuitive model. It is more suited for low dimensional data and focus on interpretability. It is more of statistics.

Algorithmic modelling is used for complex and flexible models. It can work with high dimensional data and focus on prediction. It is more of machine learning and deep learning.

When we have large amount of data, and to learn complex relationship between input and output, we use deep learning.

**Skills required for modelling data:**

i) Inferential Statistics

ii) Probability theory

iii) Calculus

iv) Optimization algorithms

v) Machine learning and Deep learning

vi) Python packages and frameworks (numpy, scipy, scikit-learn, TF, PyTorch, Keras)