Data Science is the science of collecting, storing, processing, describing and modelling data.
Data Science is assortment of the following.
1. Collecting data
In many cases, data scientists work with existing data sets collected in the course of other investigations. But the way that data is gathered and stored can limit the questions that may be answered from it and relevant data is not always immediately available. If the data is not existing, it is needed to design experiments for collecting data.
Skills required for collecting data:
i) Intermediate level Programming
ii) Knowledge of databases
iii) Knowledge of Statistics
2. Storing data
Simple and structured data (Example: Employee details) can be stored using relational databases.
When data is available in multiple databases, it should be integrated into common repository such as Data Warehouses to support analytics. Data warehouses stores structured and curated data from multiple databases, which is optimized for data analytics.
Large amount of unstructured data (text, image, video and speech) is generated in the past decade. These unstructured and uncurated data are stored in Data Lakes. Data Lakes can store both structured and unstructured big data.
Skills required for storing data:
i) Programming and Engineering
ii) Knowledge of Relational Databases
iii) Knowledge of NoSQL Databases
iv) Knowledge of Data Warehouses
v) Knowledge of Data Lakes (Hadoop)
3. Processing data
Due to anomalies found in data and common necessity of cleaning up messy raw data, the data scientist will have to “wrangle” it before moving further into the modeling process.
Data cleaning involves filling missing values, standardizing keywords, correcting spelling errors, identifying and removing outliers.
Also known as “munging” this hard-to-define step is one of the ways that data scientists make the magic happen—bringing skills and intuition to bear to take messy, incoherent information and shuffle it into clean, accessible sets.
Processing data also includes scaling, normalizing, and standardizing.
Skills required for processing data:
i) Programming skills
ii) Map reduce (Hadoop)
iii) SQL and NoSQL databases
iv) Basic Statistics
4. Describing data
Describing data includes visualizing data and summarizing data which is dealt with descriptive statistics.
For summarizing data mean, median, mode, standard deviation and variance are used.
Skills required for describing data:
5. Modelling data
For modelling data, statistical modelling and algorithmic modelling are used.
Statistical modelling is used for simple and intuitive model. It is more suited for low dimensional data and focus on interpretability. It is more of statistics.
Algorithmic modelling is used for complex and flexible models. It can work with high dimensional data and focus on prediction. It is more of machine learning and deep learning.
When we have large amount of data, and to learn complex relationship between input and output, we use deep learning.
Skills required for modelling data:
i) Inferential Statistics
ii) Probability theory
iv) Optimization algorithms
v) Machine learning and Deep learning
vi) Python packages and frameworks (numpy, scipy, scikit-learn, TF, PyTorch, Keras)