Hadoop is an open source software promoted by the Apache Foundation in order to enable enterprises to realize the advantages of distributed computing and scalable storage capabilities, using entry level servers. In this article, let us get to know more about Hadoop and analyze why customers would want to deploy it in their companies.
Hadoop is an open source software application that consists of two main components – HDFS (Hadoop Distributed File System) for Storage and MapReduce for distributed computing. Both these components enable a company to buy entry level servers and utilize the storage/ processing abilities in them to create a distributed computing system.
HDFS or the Hadoop Distributed File System enables companies to break files into smaller blocks (like 64 MB, for instance) and store the individual blocks in multiple servers. Hadoop ensures that each block is stored in at least three servers. Though this process increases the total memory required to store data, it gives a good amount of redundancy as data can be recovered and reconstructed automatically even if up to two disks or two servers fail at the same time.
Since HDFS creates distributed file storage system that divides the original data into various smaller blocks and stores them in separate servers, the MapReduce function of Hadoop takes advantage of this distributed storage functionality to provide distributed computing capability. A big file is split into various smaller blocks and distributed across individual servers. Multiple such blocks are processed simultaneously using the computing power of individual servers and all their output is assimilated by a master server to create the final output which can be presented to the user.
The individual nodes used in Hadoop needs to be Computing devices (like servers) with adequate storage contained within them. Hadoop can even assign a job to a faster (idle) server while the same is being processed by a slower server in order to get the job completed faster. Hadoop supports multiple types of hardware systems – Any system, from normal computers to multi-core servers can be used in any combination.
Of course, this type of processing is not suitable for certain applications like indexing the data and querying for a particular value using that index (relational databases). Hadoop might be more suitable for applications where all the data in a large file needs to go through similar type of processing (like image / audio/ video processing applications, for example).
The advantages of Hadoop are :
- Redundancy for Storage (in the event of disk / server failures).
- Faster access of storage data for processing (as all the data that needs to be processed is stored in the same server).
- Distributed Computing using economical entry level servers enables faster processing, while keeping the costs low.
- Complex data structures (which are not supported fully by database systems) and large files can be processed quickly.
- Enormous Scalability – One / few servers to thousand or more servers can be managed by Hadoop.
- Cost savings on the software as Hadoop is an open source project that is free to download and use.
Its limitations include its non-viability for certain types of applications (like relational database applications, etc) & increased server storage capacity with a finite upper limit to storage expansion. Hadoop is also not meant for processing / low latency access of small files.
In most cases, application developers need not write their application to be compatible with distributing computing systems, as Hadoop takes care of that part. So, any existing application can be run using multiple systems for storage efficiency and better compute performance.
Hadoop has got a lot of traction in recent times and many companies have implemented it successfully. There are service providers who are specialized in enabling Hadoop based distributed computing systems for customers.
You could stay up to date on the various computer networking/ enterprise IT technologies by subscribing to this blog with your email address in the sidebar box that says, ‘Get email updates when new articles are published’