Common Industrial and Commercial Data Sources

May 01, 2021

In this video, we'll look into different kind of data sources that could be leveraged while doing data science projects. You will learn to recognize the primary industrial and commercial data sources around us and use this knowledge to select a suitable data source for your business processes.

In this video, we'll look into different kind of data sources that could be leveraged while doing data science projects. Now let's look into filebase systems, which is one kind of a database. Filebase systems are the most elementary and traditional method for storing data. These systems store the data with the application itself without any external application storing its data.

Great examples of such systems are Microsoft Word, Excel, and Presentation, which stores the data that you create on these applications in the application itself. You would notice that you don't use any other application to store your work. These systems are very effective in small case businesses or personal work, but have a lot of disadvantages for commercial or large business use case. Let's see some of the disadvantages. The first one is program data dependency.

Suppose that in Company X, its sales department is storing its data in Microsoft Excel files while on the other hand, the HR department is using plain text files for storing data. In this case, the files from the sales department cannot be integrated directly with the HR department files. Also, if the HR department does not have Microsoft Excel installed, they will not be able to see any files from the sales department. This means that the data is highly dependent on the program used for storing it.

Second disadvantage is separation and isolation of data. Since every program uses different back end to store their data, it creates isolation of data from one application to other creating a lot of flags and sync issues between data. In the same example we saw previously, you can see that sales and HR department data will not be merged and sync automatically. And last disadvantage is duplication of data. Because of program specific data dependence leading to isolation and separation of various data across company, it also leads to redundant storage of data.

Many teams and application might be storing same information again and again, leading to loss of space and challenge to maintain the updated version. Now, let's look into alternative of filebase system, which is database. Databases are a collection of data stored in an organized manner for the ease of accessing, storing and managing. The databases are widely popular and have eliminated the need for filebase system altogether.

Since now, a central database can store all the data. Let's take a quick use case to understand the benefits of databases. As discussed, idea of database is to centralize the storing of data such that access and management of it becomes easier. For that purpose, a company can create two different tables inside a database to maintain two different kinds of data which are inherently linked through a certain value.

Like there's a table for employee where all the personal details are saved, and then other table where the employee department details can be saved. As you can see, both tables are combined through department ID or Dep_id. This way, it's easier to maintain records for respective departments. Like, HR department will make sure to keep personal information updated, and the employees team will maintain their department specific details. One of the important components of databases are their schemas.

You can think of them like blueprint of data that needs to be stored. Schema makes sure to enforce certain rules in storing and managing database like you can't add a number in the names field, you can't leave certain attributes empty, and you can't have multiple same IDs for employees and many more like this. [Video description begins] The Department table consists of two columns. The first column includes Dep_id, Dep_name, Dep_location. The second column includes INTEGER, VARCHAR(20), VARCHAR(15). The Employees table consists of two columns. The first column includes Emp_id, Emp_name, Job_name, Manager_id, Hire_date, Salary, Commission, Dep_id. The second column includes INTEGER, VARCHAR2(15), VARCHAR2(10), INTEGER, DATE, DECIMAL(10.2), DECIMAL(7,2), INTEGER. [Video description ends] Now let's explore Data Warehousing.

Let's start with the basic definition of Data Warehousing, simply, the practice of consolidating multiple databases, both online or offline, and from any source into a central location, such that, they can be used for downstream tasks without the worry of accessing multiple sources. Data warehousing is not a new concept and has been there from long time.

But in recent times, businesses have started collecting great amount of data estimated to be around millions of terabytes, and then Data Warehousing has become very mainstream since more than ever, businesses need tighter integration and centralization of different sources. Another very important point to note here is that usually any big businesses don't just have single source of data. They usually have multiple sources and databases, depending upon their requirements.

This then primarily leads to fragmentation, which is then solved through data Warehousing. After centralization and indexing data from different sources, let's say for any big enterprise from customer relationship data, billing data, ERP systems, or any other source, it can be leveraged for centralized reporting, analytics and mining, any crucial pattern. This process is typically referred to as ETL or extract, transform, and load.

During the ETL process, any data from different sources goes through necessary transformation, for example, maybe merging billing and customer data or creating new columns for more readable format like converting timestamps to normal date and time. Finally, once all required transformations are done, load phase loads the data into centralized Database ready for any further usage. Now, let's explore the idea of another element of data, which is Big Data. In spite of massive advancement in storing data in databases and data warehouses, and their capabilities in storing and processing large amounts of data, there was something missing. All of these sources were able to handle structured data only, meaning data that can be represented in tabular format. The missing element here is unstructured data. Unstructured data refers to data that has images, videos, text and audio. These types of data are not easily saved or represented in tabular format, and hence there's a need for different kind of databases that can support this. Also, any unstructured data is usually very big in size. Like imagine how many videos, images and audio data companies like Facebook, Instagram have to save every day. Hence, we usually refer to them as Big Data. Do note that sometimes large structured data can also be part of Big Data, but loosely we refer to Big Data in the context of unstructured data. To process huge amount of Big Data, we need equally powerful processing systems. For such cases, we have parallel processing clusters that can do the ETL processing in distributed manner, meaning they can do all required work by parallelizing processing across multiple machines rather than using single machines. They are usually referred to as Big Data clusters. One of the great examples of Big Data processing system is Hadoop, which is widely used across industries. One of the important parts of Hadoop is Hadoop distributed file system, which can take any Big Data, and break it down into smaller, manageable files that can be stored across multiple machines commonly referred to as clusters. This breakdown is important, since now if there's any required processing which is executed, it can do that parallel across multiple clusters leading away to give us faster results. Do note that this is required because we are dealing with very large amount of data and it's nearly impossible to process them in single machine.