Big Data is a term that that became popular during the 1990s, back when data sets became too large to be processed using conventional software tools. Nowadays, Big Data is resembling the Universe, as its size keeps expanding and it will continue to do so.
One framework designed to handle such a huge amount of data is called Hadoop. It was initially released in December 2011 and its popularity has been rising ever since, as businesses are focusing on Big Data at a continually increasing rate.
What is Hadoop?
Apache Hadoop is an open-source software written in Java, a framework that allows for the distributed processing of very large data sets across clusters of computers. It is using simple programming models and it provides massive storage for any kind of data, from single servers to thousands of machines, while offering local computation for each one.
The concept for this open-source software started when search engines and indexes were created to help locate relevant information. In the beginning there was Nutch, an open-source web search engine developed by Doug Cutting and Mike Cafarella.
As the World Wide Web grew fast, automation was needed. After Cutting joined Yahoo! in 2006, the Nutch project was divided. The web crawler portion continued to exist as Nutch, while the distributed computing and processing portion became Hadoop.
It did not take too long to establish a dedicated and diverse community of committers and maintainers supporting and improving the software. Eventually, large web scale companies as Facebook, LinkedIn and Twitter started to integrate Hadoop into their data warehouse frameworks.
What makes it a good choice?
Hadoop has the ability to handle both structured data, like the one found in relational databases, and unstructured information.
Because it is an open-source software, Hadoop can be used with zero licensing and support costs by any organization. It can deliver data into computing infrastructures at a huge transfer rate, which can easily exceed 2Gbps per computer in the MapReduce layer.
Main features:
- Cost effective
- Customized dashboards
- YARN Capacity Scheduler
- Resilient to failure
Hadoop has proven its worth in thousands of different use cases and cluster sizes, from startups to Internet giants and governments, against a variety of full-scale production applications.
Another one of its advantages is that it lets you store as much data as you want in whatever form you need by adding more servers to a Hadoop cluster. Each new server brings in more storage, thus being far less costly compared to prior data storage options.
Hadoop has its own shortcomings as well
Overall, this platform is regarded as an efficient Big Data environment. On the other hand, there are limitations in the MapReduce framework of Hadoop that prohibit efficient operation, especially when you require inter-node communications.
Although Hadoop is quite popular for being free and fast, it is obvious that not every part of it is a plus. When it comes to sensitive data, the risk of accidentally losing personal information, due to Hadoop’s sluggish security capabilities, it is not considered worthwhile by some organizations.
Where to go from here?
As revealed by CIO Magazine, almost 50% of the businesses taking part in a survey have already implemented a Big Data environment, such as Hadoop, or are in the process to do so. Here are the numbers:
- 7% – have already implemented a Big Data environment;
- 19% – have already implemented and will continue to grow their Big Data environment;
- 22% – are in the process of implementing a Big Data environment;
- 18% – are planning to implement a Big Data environment within a year;
- 34% – are planning to implement a Big Data environment, although it will take 13 to 24 months to complete.
Hadoop still has a lot to achieve, albeit it is not a disappointment, nor outdated, yet. The future remains promising for this open-source software, as more and more companies are willing to implement the framework into their day-to-day activities.