How do big companies Manage their Huge Data !!
Hey Guys hope you are doing well amidst all that is going around the world.
So, as you have read from the title, today we are going to see how Big Companies and MN-Cs like Google, Yahoo, Instagram etc. manage the huge amount of data that they receive every single day. This post wont be going much in to the technical part.
We are going to see how they store the data, manipulate the loads of data that they receive, how they manage to make it work so speedily, what technology these various companies use and how efficiently they work.
The digital era has created an overwhelming amount of information, with total amount of data projected to rise to 44 zettabytes by the end of 2020. This massive amount of data has proven to be immensely valuable to large enterprise companies.
WHAT is DATA ??
So, the most important question we face: ‘ What is Data?’
Data, as you may know is just raw facts and figures. Collection of Data in Information. This information or Data is being created every second of every day!! From a MN-C company to even small startups, all generate huge data. Any business with a website, a social media presence, and accepts electronic payments of any kind is collecting data about customers, user habits, web traffic, demographics, and more.
All that data is filled with potential if you can learn to get at it. That is the reason why all these Companies keep our data. They store it, process it and analyse it. In this way they research and draw conclusions to how more efficiently they can market their product just from our data.
So, this is where things get interesting. Many of the Multi National Companies create and receive large date every day. This data is so much that it becomes hard for them to manage it. It becomes hard to transfer the data as it is in petabytes and maybe even more.
There’s nothing new about the notion of big data, which has been around since at least 2001. In a nutshell, Big Data is your data. It’s the information owned by your company, obtained and processed through new techniques to produce value in the best way possible.
To manage this data, the Concept of Big Data is introduced.
WHAT is BIG DATA ??
Over the past few years, you must have heard the term “Big Data” which is defined in different ways.
Big Data describes the large volume of data in a structured and unstructured manner. The data belongs to a different organization and each organization uses such data for different purposes.
Giant companies like Amazon and Wall-Mart as well as bodies such as the U.S. government and NASA are using Big Data to meet their business and/or strategic objectives. Big Data can also play a role for small or medium-sized companies and organizations that recognize the possibilities to capitalize upon the gains.
Big Data is a data set that is huge and complex so that traditional data processing applications are inadequate to deal with them. There are challenges to managing such a huge volume of data such as capture, store, data analysis, data transfer, data sharing, etc.
With the advent and increased use of the internet, social media has become an integral part of people’s daily routine. Social media is not only used to connect with others, but it has become an effective platform for businesses to reach their target audience. With the emergence of big data, social media marketing has reached an altogether new level. All the posts, blogs, photos and videos posted by users on their social network contain useful information about their demographics, likes, dislikes, etc. Businesses are utilizing this information in numerous ways, managing and analyzing it to get a competitive edge.
Every other company creates big data everyday. The problem arises where managing this big data comes in place. We as a user want response from technology to be super fast.
Suppose you are watching a video on YouTube, if the video lags, we do not like it as we don’t want to wait even a second. Suppose you want to upload your photos and videos on cloud but the data is so large it may take hours and hours and still it wont finish uploading!! Wouldn’t it annoy anyone?
To solve various other problems like this, these Companies take numerous measure to match our needs. Hence, big data solutions have emerged to help analyze and categorize information, as well as to predict market trends.
In our PCs and laptops we manage our Local disks to store, edit, in-short manage data. But this Local disk is a physical Disk drive which is limited to a certain capacity. This Local disks are great for personal use, but for managing data that is in petabytes and exabytes, we use Distributed file system.
Distributed File system is a file system that is distributed on multiple file servers or multiple locations. It allows programs to access or store isolated files as they do with the local ones, allowing programmers to access files from any network or computer.
This is Where I introduce you to HDFS !!
HDFS or Hadoop Distributed File System is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. What is a node? let me explain.
Hadoop Distributed File System follows the master–slave data architecture. The Master Acts as a NAME NODE (NN) and the works as DATA NODE (DN).A small Hadoop cluster includes a single master and multiple worker nodes.
Each cluster comprises a single Namenode that acts as the master server in order to manage the file system namespace and provide the right access to clients. The Datanode is assigned with the task of managing the storage attached to the node that it runs on. HDFS also includes a file system namespace that is being executed by the Namenode for general operations like file opening, closing, and renaming, and even for directories. The Namenode also maps the blocks to Datanodes.
Here, data is stored in multiple locations, and in the event of one storage location failing to provide the required data, the same data can be easily fetched from another location.
It is distributed across hundreds or even thousands of servers with each node storing a part of the file system.
HDFS works exclusively well for large datasets, and the standard size of datasets could be anywhere between gigabytes and terabytes.
Thus, we see how HDFS conveniently solves the issue of processing streaming data since it is a platform that is used for batch processing and not for interactive use. This lends itself to Big Data applications where data is coming thick and fast and it has to be continuously processed in order to make sense out of it in real time or near to real time.