In the previous article, you had an overview of data science. Research shows, proper implementation of data science workflow can achieve the same results in much less time. This article will break down this workflow into a set of steps.
What is Data Science workflow?
A workflow is a set of defined steps to achieve a goal. Most real-world problems are super complex and very difficult to track. The workflow helps to break the problem into smaller chunks and measure the progress of the overall project.
There are no standard workflow designs and different industry has their customised way of following a data sc. process. However, the CRISP-DM model is quite popular.
(image source: wiki)
There are 6 major steps in this workflow:
Business Understanding
The idea is here to understand the domain or business. Data science is not about just gathering data and publishing the results.
Ask yourself why you are doing the analysis.
Who is going to benefit from this?
What would improve after you hand over this analysis to the stakeholders?
What are the important things to consider?
In a nutshell, know your audience first.
Data Understanding
Once the objective is clear, the next part is to collect data and check for data quality.
Data collection can be a lengthy process as there can be multiple data sources.
Can you have a single source of data?
Check for data types, descriptions, and missing values.
Basic examination of data quality ( why is it important ).
How can you improve data quality?
Business and Data Understanding generally happen together as one might need to go back to business to understand data and vice versa.
Data Preparation
After data collection, one needs to combine data from different sources into a single source or file, clean the data and perform the necessary data transformation. Generally, it's the most time-consuming of all the steps.
Does the single source have all information?
How do you handle missing data?
Do you have redundant information? Checking co-relation matrix.
Clean the data and check outliers.
Data type conversions.
Creating new features out of existing ones.
One hot encoding for categorical data.
Feature scaling to give equal weightage.
Good data preparation can enhance the performance of a project significantly.
Modeling
Modeling is the phase where one generally uses standard algorithms to train on the prepared data. Machine learning algorithms are applied to datasets to learn patterns and generate rules. Models are the output of these algorithms.
If the data is not well prepared, models can be biased and the same will reflect with predictions- so modeling and data preparation are dependent on each other.
Choose a set of algorithms for modeling.
Perform Cross-validation to check the overall performance of these algorithms.
Do you observe some algorithms perform better than others?
Fine-tune your model by changing the parameters of these algorithms.
Evaluation
Modeling without evaluation does not make sense- therefore once you shortlist the better-performing algorithms, you should test the output of these models.
Train test split to train on 80% dataset and test on 20% data.
There is no rule of 80-20 split, can be 70-30 or other values.
Choose evaluation metrics.
The models should have a justified error level.
Does the analysis help business reach its goals?
If yes, can that be measured and improved by visiting the modeling/data preparation phase?
Do you have solid conclusions from the analysis?
Can that have an impact?
Answers to these will help you understand how good is your evaluation. If you are satisfied, the final stage will be the deployment.
Deployment
Once you deploy your data science model, business stakeholders can test and see the performance.
Did you create documentation for deployment and testing?
Prepare a guidebook for non-techies on how to use it.
Cloud deployments are pretty standard these days. (More about the cloud)
Real-time ML model prediction enhance usability.
How much time do I need to spend?
It totally depends on the project. Same time, the proportion also changes based on the exposure and scope of the project. Most real projects would have a lot of emphasis on data collection and preparation, however, learning projects might have already prepared data.
Considering real projects, the below time break-up will be a close approximation:
What next?
You have now a brief idea of the steps you need to perform to complete your data science project. A possible next step will be to select a dataset from your topics of interest and apply what you learned.