Data Science Demystified-Preet

What is data science? Formally, data science is describe as the science the science of collecting, pre-processing, describing ,modelling data in order to get insight from the data or test a hypothesis. In other words data science is simply the process of use data to get insight and make better decision. A good example is the world war 2 bomber analysis , where by analysis the allied bomber which came back from bombing Germany and recording where returning bomber were damage by gun fire ,the found about that most bomber were lost to fire to the cockpit, tail and engines and there area had a increased in armour to improve their survivability.

For more, clink on the link:

Collecting data is the most import part of data science, as they say garbage in garbage out. If the data collect is not an accurate representation of real world problem you plan to solve then the analysis we found from the data will be wrong. Also data must be collected in a ethical manner and we should inform for what purpose we are collecting this data.

Data pre-processing involves getting rid of empty data, wrong data and if required normalization and standardization etc. If require we must eliminate redundant data and if possible make new feature which represent multiple existing feature so we reduce the amount of data we need to process.

For data describing we mean to give the frequency, mean, median, mode of the data. Data visualization is the most important part of it. For good data visualization we must ensure that whenever possible the data is scaled but sometime like to show the fever in a person then we must not scale the data. The plot should be easy to understand and the data that is displayed should be easy to digest. A good example of data visualization is the famous Napoleon invasion of Russia graph:


This graph show the in the width of the line the number of troops (black for retreat and pink for invasion). Important geographical location are shown in order and are joined with the temperature. The graph is multidimensional and show how the temperature effected the invasion of Russia.

A bad example would the following graph:

Here the problem are the fact that the people who decided to plot the graph decide to show both relative and absolute value. Plus on the white background the graph plots a white bar which is hardly visible, plus there is no scale for absolute values and relative are simple shown on plot and not in scale. In the effort to make the graph multi-dimensional the graph has become difficult to understand.

We must check if the insight we got from the data are statically significant and accurate. For prediction we must check the model on an unseen test data and to if it performs as intended and it should be deployed only when then it passes the test. In terms of modelling there are two kinds of model mainly statistical modelling which is used when less data is available or when the relationship are simple while algorithmic modelling which is used when there is more data or when the data is more complex. In any case if the model has passed the unseen data test then it is ready to be deployed for testing in real world situation.

So let’s enjoy our journey in data science.