What is Data Science?
Data Science is the science of collecting, storing, processing, describing and modelling data.
So to understand Data Science, we should know what these tasks are:
- Collect data
- Store data
- Process data
- Describe data
- Model data
A person may not involve in all the tasks but also only in a part of the task. So can’t necessarily say, a Data Scientist should do all the tasks to be named as a Data Scientist.
1. Collecting Data:
- Depends on the problem we are trying to solve.
- Depends on the environment in which data scientist is working.
Different ways of data Collection:
- Data may not be readily available, but have to crawl the web or feed from paid API’s like Facebook, Twitter etc. For ex to get political opinions of people.
- Data may be available in Data rich organizations like Amazon, Google etc. Just need to write code to access DB or JSON Files to read and filter accordingly.
- Data Not at all available, and have to design experiments to learn the data and it may take months of time to complete the experiments like in farming etc.
2. Storing Data:
Today data is in abundance in multiple forms and formats. So the need to store it efficiently, which came in form of Data Warehouses.
Data warehouses are a common repository, where data from multiple structured databases are fed in. Data warehouses store structured data only and are optimised for analytics.
But today high volume data is unstructured data , in the form of speech, text, video, audio etc. and the data is generated at
- High volume
- High variety
- High velocity
Data lakes: A data lake is collection of all sorts of data that flows from within and outside the organization. Data Lakes hold Unstructured Data.
3. Processing Data:
Data Processing involves multiple tasks:
- Data wrangling and Data Munging: Extracting the data, transforming it with labels etc. and loading in some particular format like JSON.
- Filling missing values
- Standardise keywords tags
- Correct spelling errors
- Identify and remove outliers
Data scaling, normalising, standardising:
- Scaling: Scale data from km to miles or rupees to dollars etc
- Normalise: using Zero mean, unit variance etc
- Standardise: all values between 0 and 1. Using standardisation formula.
But when data is too huge to be processed, we should apply distributed processing, in which we divide the data in small chunks and process. Some big software’s like Hadoop uses Map Reduce to deal efficiently with large amounts of data.
4. Describing Data:
Describing data involves:
Visualising Data, using:
- Graphs, Bar charts, Pie Charts which display clear picture of graphical representation of data to compare our sales of products etc.
Summarising data, by:
- Finding mean, median, mode etc
- Standard deviation, Variance
5. Modelling Data:
Modelling is to find a distribution function, to find the relation between input and output. The function can be as simple as a linear equation and as complex as Quadratic, Polynomial, sine, tan functions.
So Modelling can be mentioned in two separate classes:
- Modelling underlying data distribution
- Modelling underlying relations in data
- Formulate and test hypothesis
- Give statistical guarantees(p-values, goodness-of-fit tests)
Statistical modelling are simple intuitive models suited for low dimensional data but robust statistical analysis.
Finding, the relation between input and output i.e. Y = f(x)
f(x), can be any function. In real world data, the function can be very complex. The ultimate goal is to estimate a function f, using data and optimization techniques
- Complex Flexible models
- Can work with high dimensional data
- Not suitable robust statistical analysis
- Focus is on prediction.
- Data hungry models
Myths of Data Science:
- Machine does Everything.(lets debunk this myth)
- What to collect? -> Programmer job
- Where to collect? -> Programmer job
- How to collect data? -> Programmer job (by experimenting etc.)
- Labelling data? -> Programmer Job
- Executing Scripts? -> Machine Job (Processing long complex jobs)
- What schema? -> Programmer Designs
- Which file system? -> Programmer Decides (but machine provides the system resources like storage)
- Domain knowledge required in Wrangling and munging data. -> Programmer Job
- What data to clean? Programmer decides
- How to clean? Programmer has to know what to clean using statistics
- Study and Integrate: Programmer Job
- Multiple formats: Programmer decides what format to work with
- Machine helps in executing scripts for processing large amount of data.
- Which columns? Programmer decides what column data is usable
- Which plots? Human readable format, Programmer decides
- Study trends? Programmer decides which trends using machine
- Execute scripts by machine to formulate large amounts of data
- Hypothesise, Propose, models, Oversee, Training, all done by Programmer
- Estimate, parameters are learnt by machine by trying to optimise using some learning algorithm.
A very small part of the Data Science job is automated by machines, rest of collecting, storing, curating, describing and modelling data is a pure human job.
- Collecting data: