Data Science Project Steps and Processes

Apr 01, 2021

In this video we will investigate more about the data science project lifecycle and stages. Learn the steps and processes essential to any data science project.

In this lesson, we will investigate more about the data science project lifecycle and stages. In this slide you can see the stage is related to any data science projects, starting with the business objective, data preparation, descriptive analytics, fourth is predictive analytics, and finally model validation and implementation.

Now, we will go through each part with some more details. Let's start with the first phase, the business objective. This is the very first phase of any data science project. In this phase, we need to specify the aim of our project from a business perspective and clearly define the deliverables.

For example, we can initiate a project for maximizing our revenue or extend our customer base, or maybe simply finding a good recommendation for our customers. This phase implies three main sub-phases. First is to define a business objective. Second is to define a scope and finally specify analysis approach. Second, in line is the data preparation.

In this phase, we focus on data that will be used to satisfy and meet a business objective and work on extracting them from their source. Keep in mind that the required data may be in data warehouses, clusters, data lakes, databases, or even on the web. Along with sourcing the data, this phase involves you to clean and validate the data.

Therefore it implies four main sub-phases, data collection, data exploration, data validation, data cleaning. Third is descriptive analytics. In this phase we are interested in knowing our data and get to the meaning, effect and interaction of each data points usually refer to as features with each other. This process is often the key to the success of any data science project because more you know about your data, the more you will realize its capabilities and how to get the most out of it.

This process may imply some or all of the following sub-phases. Descriptive statistics, univariate and multivariate analysis, visualizations and insight, creating new expressive features. Fourth is predictive analysis.

Once we're pretty confident with our scope, objective and data, we can move to the next phase, which is building predictive models using data mining, machine learning, or deep learning approaches, which I usually refer to as models. The main feature of this phase is to build prediction models that can help us predict the future data points by taking historical data in consideration.

This phase may imply some or all of the following sub-phases. Modeling technique identification, building predictive models, model iteration and best fit, model interpretation. Fifth phase is model validation and implementation. After we get a model in the previous phase, we try to validate that our model is working fine and will have desired results when used in production, meaning used in real or live business data.

This step is crucial when deciding the success criteria of our data science project. Also, this step may imply some or all of the following sub-phases. In-time validation, out-of-time validation and model recalibration. Along with validation we have implementation when reaching this phase, we are now very confident that our model performance is as desired as per our objective and scope.

Meaning, we can now make our models live and let the business consume it for further actions. This process implies the following sub-phases. Model deployment and model documentation. Do you think the traditional waterfall model works for data science? The answer is no, and let's see why.

A very important point to clarify is that the process of data science project is in cyclic process, which means we can go back and forth between any two consecutive phases to reach the best final performance and outcome for business. For example, in the data preparation step, we may realize that the existing data will not conform with our business objective and is not giving us desired model accuracies.

So, we might spend more time getting more data from the source, or maybe cleansing or preprocessing it further. That's all for this lesson. Thank you.