• Due to system maintenance, registration for NYUSPS noncredit courses and some credit course detail information will be unavailable from 2/23 at 6:00 p.m. ET through 2/23 at 10:00 p.m. ET. Thank you for your patience.

  • Enrollment system is currently unavailable. We apologize for the inconvenience. Please try again later.

Technology Diploma in Database Management with Hadoop

Industry analysts, such as The Data Warehousing Institute (TDWI), have indicated that Hadoop usage will become near universal in the next few years among mainstream companies for a variety of essential data applications. A TDWI survey result demonstrates that the use of Hadoop systems increased 60 percent in the last two years alone. Hadoop and its ecosystem of tools demonstrate advantages over traditional databases in their ability to analyze and manage enormous amounts of data, while rapidly producing raw data without requiring extensive structuring of that data.

The Hadoop ecosystem has matured to the extent that it can now analyze and process both unstructured and structured data from multiple sources. Several SQL tools have emerged for the query and analysis of even unstructured data that was not originally available, making Hadoop critical now for successful data warehousing, data scalability, data analysis, and the handling of multiple data types.

The Diploma in Database Management with Hadoop provides a comprehensive introduction to big data; the Hadoop database; and the so-called Hadoop ecosystem of products used for data querying, data managing, data analyzing, and data scripting work, including Hive, Pig, Drill, Impala, and Spark. This immersive curriculum covers what big data is, introduces the Hadoop components, and describes examples of the uses of Hadoop and its management. Additionally, the Diploma includes a comparison of several Hadoop distributions, including the three market leaders: Cloudera, Hortonworks, and MapR.

Work Product

As a student in this program, you will be required to create a portfolio of assignments showcasing your understanding of, and ability to apply, the core concepts and competencies of the Diploma, including the tools in the Hadoop ecosystem. Additionally, you will need to complete a final capstone project during the fourth course, in which you work in teams of two to four to demonstrate your ability to complete a Hadoop project that will include the processing and analysis of data using the tools of the Hadoop ecosystem. Your team will unveil its project to the class and to invited industry professionals during a 15-minute recorded final presentation, followed by a five-minute Q&A, to assess your ability to defend your work and to gauge your formal presentation, public speaking, and communication skills. Once in the job market, you will be able to submit this work product to illustrate your knowledge of processing data using Hadoop.

Job Skills

  • The ability to use Spark’s interactive shell to load and inspect data and to build and launch a standalone Spark application
  • The skills to perform SQL queries using Hive, Drill, Impala, and Spark-SQL in unstructured and structured databases
  • The ability to load data into a Pig relation and build a data flow using that data to illustrate the concepts of extracting, transforming, and loading (ETL) data
  • The skills to start Flume and transfer server log data into a HDFS table
  • The ability to use Ambari to monitor and manage a Hadoop database, as well as Microsoft’s HDInsight within SQL Server Azure

 

Schedule and Format

  • Part-time
  • Classroom-based
  • Four noncredit courses
  • On average, students take one course per semester, fall/spring/summer
  • The Diploma must be completed within two years

 

The Diploma is awarded to students who successfully complete the four required courses within two years.

For additional information, or if you have any questions, please contact the Division of Programs in Business at diploma.dpb@nyu.edu.

Must be completed within

2 years

You'll Walk Away With

  • The competencies required to compare relational databases and the databases used with Hadoop, including the volume of data, structuring of data, and retrieval techniques, as well as the knowledge of when each technology is appropriate for use
  • The confidence to explain the reason and value of the distribution of data storage, processing, and retrieval of data across multiple computer systems, known as clusters or grids in a Hadoop system
  • The practical experience to explain how the open-source non-relational distributed database Hbase is predominately used with Hadoop; and the ability to show how it provides fault-tolerant and optimized storage of processed data (known as sparse data)
  • The ability to list and explain each of the optimization options used for Hbase data, including block and record compression, Bloom filters, in_memory, max_length, and max_versions
  • A portfolio of examples demonstrating your aptitude with each of the five steps that the map-reduce model of programming uses to process vast amounts of data in parallel on multiple computer systems
  • The skills to demonstrate the use of the query languages Hive and Drill to create, load, and retrieve data in Hadoop storage structures, as well as to discuss when each would be applicable
  • The vocabulary and practical application capacity to describe the architecture of Spark and to demonstrate the use of Spark to load and inspect data using the interactive shell

Program Curriculum

COURSES THAT GIVE YOU THE SKILLS AND TRAINING YOU NEED TO START YOUR NEW CAREER

REQUIRED

All Courses Required

Introduction to Big Data and the Hadoop Ecosystem

Gain an introduction to big data; the Hadoop database; and the Hadoop ecosystem of products used for querying, analyzing, and scripting work.

2018 Spring
1 section

Processing and Data Retrieval in a Hadoop and Spark Environment

Learn to characterize Hive, Drill, Impala, and JAQL-like query languages; load and inspect data in Apache Spark; and create a Spark application.

2018 Spring
1 section

Data Analysis and Machine Learning

Study several SQL interfaces used to perform data exploration and statistical functions in order to return summary information and advanced analysis.

Hadoop Management and the Capstone Project

Gain an understanding of and facility with Hadoop management tools, such as Apache Ambari and Cloudera Manager, and complete a capstone project.

OPTIONAL

Complete 0

JobFocus: Data Processing with Hadoop

Benefit from a better understanding of the job market and of employer expectations for positions in data processing with Hadoop.

General Admission Requirements

  • Resume
  • Two references
  • Essay
  • A bachelor’s degree or four years of professional experience

  • A satisfactory scholastic average [a minimum grade-point average (GPA) of 2.5 (C+) on a 4.0 scale]

Introduction to Big Data and the Hadoop Ecosystem

Gain an introduction to big data; the Hadoop database; and the Hadoop ecosystem of products used for querying, analyzing, and scripting work.

2018 Spring
1 section

Processing and Data Retrieval in a Hadoop and Spark Environment

Learn to characterize Hive, Drill, Impala, and JAQL-like query languages; load and inspect data in Apache Spark; and create a Spark application.

2018 Spring
1 section

Data Analysis and Machine Learning

Study several SQL interfaces used to perform data exploration and statistical functions in order to return summary information and advanced analysis.

Hadoop Management and the Capstone Project

Gain an understanding of and facility with Hadoop management tools, such as Apache Ambari and Cloudera Manager, and complete a capstone project.

JobFocus: Data Processing with Hadoop

Benefit from a better understanding of the job market and of employer expectations for positions in data processing with Hadoop.