In recent times we very often hear terms like Data Science,Machine Learning, Deep learning everywhere. But what is the link between them? Are they interconnected or is one of them a subset of the other? These types of questions confuse us leading to wrong conclusions and myths.
What is Data Science?
- It is a science of collecting, storing, processing, describing and modelling data. These are the 5 main keywords that constitute Data Science.
A. Collecting Data:
Data can be obtained in 3 ways
- What to collect? Eg: Take a shopping app like Amazon, depending on the products you buy, they store the data and recommend you further. So here you provide the data to them.
- Where to collect? Eg: Take a political party, for the members in the organization to know the reviews of their party they collect data from a third person by doing a survey. So here you fetch the data.
- How to collect? Sometimes the data is not available or the conclusions cannot be drawn, then we create data by experimenting. Eg: Take land, we do not know which crop can be yielded to get maximum profit, so we divide the land into parts and place different seeds, fertilizers, pesticides etc. So here we create data by doing permutations and combinations.
B. Storing data:
After the data is being collected, it needs to be stored to access later. Data is stored in 3 ways
Only structured data is stored in this, it is optimized for SQL queries.
Data from various structured databases gets accumulated and integrated in a common repository. It is optimized, curated and optimized for analysis.
Here is where the term Big Data intersects Data science. Both structured and unstructured data gets stored here. This is the era of Big Data- it comes with 3V’s Volume, Variety, Velocity. It is uncurated.
C. Processing Data:
processing of data involves 3 steps:
Data Wrangling or Data Munging:
It is a process of Extracting, Transforming and loading data. They integrate data from multiple sources.
Data Cleaning: involves
- fill missing values
- Standardize keyword tags
- Correct Spelling errors
- Identify and remove outliers
Data Scaling, normalization and standardization:
Normalization typically means re-scaling the values into a range of [0,1]. Standardization means re-scaling data to have a mean of 0 and a standard deviation of 1 (unit variance).
D. Describing Data:
The data can be described using two ways
By using visual elements like charts,graphs and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
By using Statistical formulas like mean, median, mode, variance and standard deviation.
E. Modelling Data:
There are 2 types of modelling:
- It needs to make robust guarantee
- It is a model underlying data distribution
- It is a model underlying relations in data
- Formulate and test hypothesis
- Gives statistical Guarantee( probability values, goodness of fit-tests)
Algorithmic Modelling / Machine Learning :
It allows you to choose a very complex relation / function to express the relation between variables in your data and that would be the case in many real world applications by ‘Prediction’. This is where terms like machine learning and deep learning come into play.
|STATISTICAL MODELLING||ALGORITHMIC MODELLING|
|Simple, intuitive models||Complex, flexible models|
|More suited for low- dimensional data||Can work with high dimensional data|
|Robust statistical analysis is possible||Not suitable for robust statistical analysis|
|Focus on interpretability||Focus on prediction|
“When you have large amount of high dimensional data and you want to learn very complex relationships between the output and input, we use a specific class of complex ML models and algorithms, collectively referred as deep learning”