Technology Diploma in Database Management with Hadoop
Industry analysts, such as The Data Warehousing Institute (TDWI), have indicated that Hadoop usage will become near universal in the next few years among mainstream companies for a variety of essential data applications. A TDWI survey result demonstrates that the use of Hadoop systems increased 60 percent in the last two years alone. Hadoop and its ecosystem of tools demonstrate advantages over traditional databases in their ability to analyze and manage enormous amounts of data, while rapidly producing raw data without requiring extensive structuring of that data.
The Hadoop ecosystem has matured to the extent that it can now analyze and process both unstructured and structured data from multiple sources. Several SQL tools have emerged for the query and analysis of even unstructured data that was not originally available, making Hadoop critical now for successful data warehousing, data scalability, data analysis, and the handling of multiple data types.
The Diploma in Database Management with Hadoop provides a comprehensive introduction to big data; the Hadoop database; and the so-called Hadoop ecosystem of products used for data querying, data managing, data analyzing, and data scripting work, including Hive, Pig, Drill, Impala, and Spark. This immersive curriculum covers what big data is, introduces the Hadoop components, and describes examples of the uses of Hadoop and its management. Additionally, the Diploma includes a comparison of several Hadoop distributions, including the three market leaders: Cloudera, Hortonworks, and MapR.
As a student in this program, you will be required to create a portfolio of assignments showcasing your understanding of, and ability to apply, the core concepts and competencies of the Diploma, including the tools in the Hadoop ecosystem. Additionally, you will need to complete a final capstone project during the fourth course, in which you work in teams of two to four to demonstrate your ability to complete a Hadoop project that will include the processing and analysis of data using the tools of the Hadoop ecosystem. Your team will unveil its project to the class and to invited industry professionals during a 15-minute recorded final presentation, followed by a five-minute Q&A, to assess your ability to defend your work and to gauge your formal presentation, public speaking, and communication skills. Once in the job market, you will be able to submit this work product to illustrate your knowledge of processing data using Hadoop.
- The ability to use Spark’s interactive shell to load and inspect data and to build and launch a standalone Spark application
- The skills to perform SQL queries using Hive, Drill, Impala, and Spark-SQL in unstructured and structured databases
- The ability to load data into a Pig relation and build a data flow using that data to illustrate the concepts of extracting, transforming, and loading (ETL) data
- The skills to start Flume and transfer server log data into a HDFS table
- The ability to use Ambari to monitor and manage a Hadoop database, as well as Microsoft’s HDInsight within SQL Server Azure
Schedule and Format
- Four noncredit courses
- On average, students take one course per semester, fall/spring/summer
- The Diploma must be completed within two years
The Diploma is awarded to students who successfully complete the four required courses within two years.
For additional information, or if you have any questions, please contact the Division of Programs in Business at email@example.com.
Must be completed within
You'll Walk Away With
- The competencies required to compare relational databases and the databases used with Hadoop, including the volume of data, structuring of data, and retrieval techniques, as well as the knowledge of when each technology is appropriate for use
- The confidence to explain the reason and value of the distribution of data storage, processing, and retrieval of data across multiple computer systems, known as clusters or grids in a Hadoop system
- The practical experience to explain how the open-source non-relational distributed database Hbase is predominately used with Hadoop; and the ability to show how it provides fault-tolerant and optimized storage of processed data (known as sparse data)
- The ability to list and explain each of the optimization options used for Hbase data, including block and record compression, Bloom filters, in_memory, max_length, and max_versions
- A portfolio of examples demonstrating your aptitude with each of the five steps that the map-reduce model of programming uses to process vast amounts of data in parallel on multiple computer systems
- The skills to demonstrate the use of the query languages Hive and Drill to create, load, and retrieve data in Hadoop storage structures, as well as to discuss when each would be applicable
- The vocabulary and practical application capacity to describe the architecture of Spark and to demonstrate the use of Spark to load and inspect data using the interactive shell