What is Data Science?
We all know data is found everywhere. Data is there in all forms too – a video is data, a small pamphlet is also data. So, in the age of data abundance, it is only natural to attempt to make meaningful use of it. Data science is THAT attempt.
As data science gains more and more popularity particularly in the last two decades, there are higher chances of it not being understood correctly. Just like in the 90s, during the proliferation of internet, there were talks about it everywhere but not everyone understood what it was clearly. This blog, is an attempt to simplify the concept, so that it can be understood not only at a technical and broader level but also at a more personal level.
Who is Data Science for?
I am a middle aged investment-banker, can I use it to minimise risk for my clients? I am a school teacher, can I use it to understand and predict the performance of my students? I head a BU in an IT company, how can I use it to enhance quality and control errors in my product release cycles? I own a restaurant, how will it help my business? I am a home-maker, how will it help me run my home? The answer is, it can cater to one and all of the domains mentioned above.
Let us explore how each one can benefit from it, with a simple example. Through this example, we can also understand the various stages involved in Data Science
Data Science in simple use
Latha runs her home. She has been observing since the last 5-6 months that the monthly savings at the end of the month from the collective salaries of her husband Shyam’s and hers is consistently declining. Initial couple of months, she did not pay attention but now as it has become consistent, she wants to take stock and address the situation. Here’s her plan of action
1. Collect Data
Latha and Shyam collect all possible bills and receipts of the expenses done in the last 6 months. During this process they understand that their expenses can be broadly categorized under the following buckets-
Monthly rent, Car loan,Groceries,Fuel,Electricity,Cellphone, Internet, DTH bills,Medical,Restaurant ,Shopping,Outings.
So they started collecting whatever bills they could gather month on month.
How is this done in business? On a higher scale, businesses or organisations, do this data collection through systematic customer feedback surveys or through the data already stored in their database through customer transactions or data bought from third party vendors like Facebook, Twitter etc . Ex: An online food portal like Zomato would have most of this data already in their DB through customer orders.
2. Store Data
While they collect the bills, they figure out the bills and receipts are in various incompatible forms
Some are in the form of bank transactions, while some are printed bills from say grocery shops or hospital bills, some are paytm screenshots, some are in the form of credit card bills, some are cash transactions for which there is no bill like say vegetables from the local vegetable vendor.
So as the data is heterogenous in nature, Latha decides to convert all of them into one standard form- An excel sheet with monthwise list of expenses.
What is the Corporate equivalient of this step? At a larger scale, companies would store these data in their Databases (SQL or NoSQL) or sophisticated infrastructure like data warehouses , data lakes , in the cloud (dropbox, icloud etc.) and share them real time (google docs)
3. Process Data
Latha and Shyam sit down to fill the excel sheet. As the volume of data for a single household is relatively much lesser compared to a business or organisation, there is no need of a programming interface here.
They prepare a sheet with names of months as columns and each bucket of expense listed above as the row.
While filling up the sheet they figure out, all data is not available or accurate . For some expenses there are no bills. So they include a standardised amount. Some bills are lost or misplaced. So there again they do some standardisation and enter an average value there by fill missing values . Thus some cleaning happens on the collected data also called as Data Wrangling or Munging.
How do companies process data? In case of larger volumes of data processing, this step would require programming skills to process different type of data forms like csv<->json-<-> xml, images, videos or SQL data and sophisticated infrastructure like high end servers, GPUs to run the code on.
4. Describe data
Once they filled up all the buckets across months, Latha and Shyam decide to visualize the data for better interpretation and decision making . They use in-built graphs of excel for this purpose, in the form of a line graph resembling a frequency polygon and pivot chart that resembles a grouped frequency bar chart that can show month-wise expense under each bucket.This turns out to be very helpful to them in drawing some inferences
How to visualize high volume data? Visualization of high volume data for industry purposes, again will require sophisticated programming often involving statistical concepts with libraries of Python like Matplotlib, numpy,scipy, R or Tableau and hardware infrastructure to execute on.
5. Model Data
The graphical visualization helped them to arrive at some critical hypothesis based on the underlying expenses distribution :
The burst of expenses under the bucket of Outings, causes fluctuations in their savings
Rent increase from 2020 was another major cause
Their adhoc groceries purchase and shopping also added elements of unpredictability in the expenses
Other expenses like Vehicle loan, EB, Cellphone, DTH bills were constant and hence help in predictability
Hence, they arrived at a conclusion that their expenditure pattern had to be made more predictable . For which they made some decisions as follows –
Instead of meeting the sudden burst of outing expenses from that month’s salary, they will plan their number of outings and the expense ahead and invest in a monthly Recurring Deposit. This will help in not only making the monthly out-going predictable but also help in parking money in RD other than only in the salary account
They decided to list down their grocery requirements as much as possible in one shot and purchase them at the beginning of every month than ad-hoc grocery shopping. This makes it more standardised and predictable thereby reducing impulsive-expenses.
They decided to offset the increase in rent by reducing on other possible areas like outings or shopping.Also they decided to plan a recurring deposit or advance saving to counter the offset created by medical expense and the annual increase in rent.
Thus the various phases of Data Science helped this family in informed decision making and helped them reduce unpredictable expenses and increase monthly savings.
Modelling at industry scale : The same can be translated for various domains like business, healthcare, education etc. Here the datasets will be of high volume and density like the Big data and hence require sophisticated processing using languages like Python, stacks like Hadoop, SQL, NoSQL DBs, data lakes or warehouses and simple modelling techniques like Normal distribution, descriptive statistics or complex modelling techniques like algorithmic modelling . The outcomes could further be fed to complex Machine learning and Deep learning algorithms for ex in Image recognition, speech recognition, consolidating ratings for various commodities etc