Understanding Data Science

Understanding Data Science
Data Science

In recent times we very often hear terms like Data Science,Machine Learning, Deep learning everywhere. But what is the link between them? Are they interconnected or is one of them a subset of the other? These types of questions confuse us leading to wrong conclusions and myths.

What is Data Science?

Data Science:

  • It is a science of collecting, storing, processing, describing and modelling data. These are the 5 main keywords that constitute Data Science.

A. Collecting Data:

Data can be obtained in 3 ways

  1. What to collect? Eg: Take a shopping app like Amazon, depending on the products you buy, they store the data and recommend you further. So here you provide the data to them.
  2. Where to collect? Eg: Take a political party, for the members in the organization to know the reviews of their party they collect data from a third person by doing a survey. So here you fetch the data.
  3. How to collect? Sometimes the data is not available or the conclusions cannot be drawn, then we create data by experimenting. Eg: Take land, we do not know which crop can be yielded to get maximum profit, so we divide the land into parts and place different seeds, fertilizers, pesticides etc. So here we create data by doing permutations and combinations.

B. Storing data:

After the data is being collected, it needs to be stored to access later. Data is stored in 3 ways

  1. Relational Databases:

    Only structured data is stored in this, it is optimized for SQL queries.

  2. Data Warehouses:

    Data from various structured databases gets accumulated and integrated in a common repository. It is optimized, curated and optimized for analysis.

  3. Data Lakes:

    Here is where the term Big Data intersects Data science. Both structured and unstructured data gets stored here. This is the era of Big Data- it comes with 3V’s Volume, Variety, Velocity. It is uncurated.

C. Processing Data:

processing of data involves 3 steps:

  1. Data Wrangling or Data Munging:

    It is a process of Extracting, Transforming and loading data. They integrate data from multiple sources.

  2. Data Cleaning: involves

    • fill missing values
    • Standardize keyword tags
    • Correct Spelling errors
    • Identify and remove outliers
  3. Data Scaling, normalization and standardization:

    Normalization typically means re-scaling the values into a range of [0,1]. Standardization means re-scaling data to have a mean of 0 and a standard deviation of 1 (unit variance).

D. Describing Data:

The data can be described using two ways

  1. Visualizing Data:

    By using visual elements like charts,graphs and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

  2. Summarizing Data:

    By using Statistical formulas like mean, median, mode, variance and standard deviation.

E. Modelling Data:

There are 2 types of modelling:

  1. Statistical Modelling:

    • It needs to make robust guarantee
    • It is a model underlying data distribution
    • It is a model underlying relations in data
    • Formulate and test hypothesis
    • Gives statistical Guarantee( probability values, goodness of fit-tests)
  2. Algorithmic Modelling / Machine Learning :

    It allows you to choose a very complex relation / function to express the relation between variables in your data and that would be the case in many real world applications by ‘Prediction’. This is where terms like machine learning and deep learning come into play.

Simple, intuitive models Complex, flexible models
More suited for low- dimensional data Can work with high dimensional data
Robust statistical analysis is possible Not suitable for robust statistical analysis
Focus on interpretability Focus on prediction

“When you have large amount of high dimensional data and you want to learn very complex relationships between the output and input, we use a specific class of complex ML models and algorithms, collectively referred as deep learning”

5 main keywords that constitute Data Science are missing from your post.
Its all mixed up content.

1 Like

Hey thanks for reminding me, I did a copy-paste from my document and forgot to edit few things here. Please let me know if there are any mistakes.

Mean median mode and standard deviation are basic concept of statistics . These are not summarize your data.
Normal distribution , Poisson distribution and binomial distribution are used to analyze your data set.
statistics basic concepts are also apply with Distribution concepts.