Hadoop in the Cloud

UKCloud’s Hadoop in the Cloud is a Platform as a Service (PaaS) implementation of the Hadoop data platform, built upon our secure, UK Government accredited Infrastructure as a Service (IaaS) environments.

As the volume, velocity & variety of data organisations generate ever increases, they need to carefully balance the cost of retaining this data against the opportunity to exploit the potential value of Big Data. Hadoop offers economic data warehouse capabilities where all data is stored and accessible, whilst also offering Massively Parallel Processing (MPP) enabling various methods to analyse and interrogate the data.

Our solution enables organisations to rapidly deploy, experiment and prove the value of Hadoop-based solutions without having to invest in the cost, time and risk associated with purchasing and provisioning infrastructure, platforms and licenses.

G-Cloud Service ID number: 7927 4847 9173 374 (Assured) 6818 2775 5080 856 (Elevated)

OVERVIEW

We believe in the massive potential that data-driven decisions have in improving public sector outcomes. There are significant opportunities to connect and federate datasets, and to use emerging analytic solutions to derive new insight. But these analytic solutions require a significant platform. Our Hadoop in the Cloud service provides that platform at the highest levels of performance and security, without sacrificing the commercial benefits or flexibility of the cloud. You have a choice of Hortonworks or Cloudera-certified Hadoop clusters, each deployed on optimised multi-tenant infrastructure, with the following single-tenant elements:

  • Single-tenant cluster. Each customer is provisioned with their own isolated Hadoop environment - virtual master nodes, virtual slave nodes and virtual cluster managers aren’t shared with any other tenants.
  • Dedicated 1:1 mapping of slave nodes to directly attached physical drives helps increase overall HDFS performance, scalability and security.

FEATURES/BENEFITS

Feature Benefit
Petabyte scale HDFS storage to power many Big Data analytics solutions Reduced the complexity of managing secure data
Optimised infrastructure engineered specifically for Hadoop Rapid one-click provisioning of a Big Data solution
Hadoop v2 supporting YARN analytics functionality Reduce cost traditionally associated with Big Data
Data integrity and enhanced availability through Hadoop’s built-in replication Increases innovation and exploration of Big Data opportunities
Built in graphical management tools, simplifies cluster management & administration Utility consumption, scale resources as and when required
Extensible analytics; HBase, Hive, Pig, Impala, Spark, Storm Ready to use immediately once deployed with minimal configuration
Extensible data ingest; Sqoop, Flume, Kafka Storage of heterogeneous data in a single data store
Choice of certified Cloudera Enterprise Manager or Hortonworks HDP distributions Reduces data replication and encourages data recycling
Elastic Analytic modules, providing greater commercial and technical flexibility Converged data analytics and intelligence, increases value from data
Optimised for OFFICIAL, designed for OFFICIAL and OFFICIAL-SENSITIVE data Specific to the data protection needs of UK public sector

SERVICE INFORMATION

Hadoop in the Cloud service information:

  • Choice of the most popular 100% open-source Hadoop distributions - Cloudera Enterprise or Hortonworks HDP
  • A true on-demand OPEX-based service which eliminates traditional CAPEX challenges
  • A fully configured deployment which helps reduce the level of expertise needed to get started with the fundamentals of Hadoop
  • Removal of the capacity management burden traditionally associated with Hadoop
  • An extensible solution which enables you to use any Hadoop-compatible business intelligence, analytics and reporting applications
  • Uncontended resources reduce any adverse performance issues as a result of noisy neighbours
  • Proximity of Hadoop data to compute resources increased processing performance and removes data transfer charges between compute and storage
  • An unrivalled big data and analytics partner ecosystem

TECHNICAL SPECIFICATIONS

Hadoop-in-the-Cloud is powered by VMware vSphere or OpenStack for the underlying assured IaaS service, with customers choosing Cloudera Enterprise or Hortonworks® HDP deployed on top as the enabling Hadoop distribution.

The platform consists of the following modules and associated supporting services:

  • Hadoop Common: the common utilities that support the other Hadoop modules
  • Hadoop Distributed File System (HDFS™): a distributed file system that provides high-throughput access to application data
  • Hadoop YARN: a framework for job scheduling and cluster resource management
  • Hadoop MapReduce v2: a YARN-based system for parallel processing of large datasets
  • Hadoop Zookeeper - a distributed transaction coordinator

The Hadoop core platform facilitates data ingress and egress along with native MapReduce v2 applications. In addition, we provide access to Hadoop cluster management via API or GUI (depending on the chosen distribution - Cloudera Manager or Ambari), for you to use to manage and maintain your entire Hadoop cluster.

We monitor all infrastructure and platform components related to the embedded Hadoop core platform as defined above, and provide ‘reasonable endeavours’ support for all associated services that interface with your analytics solution.

We provide a specific virtual data centre (VDC) alongside the Hadoop cluster to enable you to deploy your own analytics, business intelligence and reporting applications. We charge for and support any VMs within this specific VDC in line with our Enterprise Compute Cloud or Cloud Native Infrastructure service definitions. For clarity, UKCloud isn’t responsible for managing any image, software or service deployed within these VMs as we give you complete autonomy to achieve your desired outcome.

Hadoop

Figure 2. Hadoop in the Cloud overview

This service is extensible, as shown in Figure 2, allowing additional analytic applications to be deployed to facilitate functions beyond data ingest, egress and MapReduce v2 jobs. These depend on the choice of distribution and may include:

  • Cloudera Impala - a massively parallel processing (MPP) SQL query engine
  • Cloudera Search - for non-technical users to search and explore data
  • ApacheTM Spark - an in-memory data store
  • ApacheTM Falcon - for data management and pipeline processing
  • ApacheTM Kafka - a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system

Management and maintenance of any additional applications deployed by you, beyond the initial deployment, remain the responsibility of you or your chosen partners.

SERVICE OPTIONS

UKCloud’s Hadoop in the Cloud lets you scale up HDFS storage resources as required, yet retain elasticity around analytics requirements. You specify your solution according to the storage (TiB) required.

Additional Service Options

Global Load Balancing

  • Enables you to direct internet traffic across multiple end-points (which can be across different sites or different platforms)
  • UKCloud subcontracts this service to Neustar, global leaders in network and security services
  • Requires a 12-month minimum term commitment

Application-tuned DDoS protection

  • A domain-based service that can be finely tuned to the profile of specific applications and workloads
  • Complements UKCloud’s platform-level DDoS protection
  • UKCloud subcontracts this service to Neustar, global leaders in network and security services
  • Requires a 12-month minimum term commitment

FAQs

What is the service?

Hadoop in the Cloud is UKCloud’s highly secure PaaS implementation of Hadoop. It provides a cloud-based solution to help organisations address the challenges of big data storage and processing.

Why deliver Hadoop as a cloud service?

The service enables organisations to explore a highly connected, secure, stable solution that’s optimised for big data, from proof of concept through to production workloads — while minimising the investment, time and risk associated with buying, provisioning, configuring and maintaining Hadoop infrastructure, platforms and licenses.

How is Hadoop in the Cloud billed?

The service is a true cloud service, billed by the hour based on the storage consumed, with no upfront cost, minimum commitment or early exit fees.

Does UKCloud offer a free trial?

We offer a 30-day free trial so that you can test and evaluate our service without commitment. Your trial provides you with a live environment on the UKCloud platform to test our services and verify whether they are suited to your needs.

Where is the service hosted?

The service is delivered by a UK company from two tier 3 UK data centres separated by more than 100km, and securely connected by high-bandwidth, low-latency dedicated connectivity.

Does my data leave the United Kingdom?

As the service is delivered from UK data centres by a UK company, the data does not leave the UK when at rest.

How is Hadoop in the Cloud supported?

UKCloud manage and support the Hadoop Core Platform using our dedicated support team based in the UK. Support is available via helpdesk ticket or phone.

What modules constitute the Hadoop Core Platform?

The Hadoop core platform consists of the following modules and associate supporting services:

  • Hadoop Common. The common utilities that support the other Hadoop modules
  • Hadoop Distributed File System (HDFS™). A distributed file system that provides high-throughput access to application data
  • Hadoop YARN. A framework for job scheduling and cluster resource management
  • Hadoop MapReduce v2. A YARN-based system for parallel processing of large datasets

The modules in the Hadoop core platform facilitate data ingress and egress, and native MapReduce v2 applications.

Can I use Hadoop in the Cloud in the UKCloud Elevated (previously IL3) domain?

Yes, Hadoop in the Cloud is available in both the OFFICIAL Assured and Elevated domains.

Is the service Pan Government Accredited?

UKCloud’s existing PGA still applies to the infrastructure underpinning our services, but since the move to the Government Security Classification Policy (GSCP), we are no longer able to seek PGA for new services such as Hadoop in the Cloud.

We are now required to self-assert our services, with customers then responsible for assessing and selecting the most appropriate cloud services which meet their individual security requirements.

We provide confidence that the service still meets the highest level of information assurance, which is why we continue to conduct independent testing and validation of our platform, and have the findings made available to our customers and partners, thereby enabling their SIROs to make an informed decision about self-asserting any service they choose to consume.

What Hadoop Distributions will ‘Hadoop in the Cloud’ support?

This service currently supports Hortonworks® HDP and Cloudera Enterprise. We will continue to review supporting additional distributions according to market demand.

How did UKCloud define its Hadoop Core Platform?

To deliver a quality services, we identified the boundaries of Hadoop in order to make a clear delineation between UKCloud-provided and -supported services, and customer/third-party services. We’ve adopted the industry definition of Hadoop as per the Apache Foundation http://hadoop.apache.org/

Can I use Hadoop in the Cloud over closed networks such as PSN or N3?

The service is accredited for use over PSN. Connectivity to the N3 network will be considered when an appropriate sponsor submits a requirement.

What is the underlying storage technology for the service?

We designed our platform to be optimised specifically for Hadoop, in line with best practices established by VMware and the Hadoop community.

Unlike some Hadoop cloud service providers, we give each node VM exclusive access to a physical drive attached directly to the host, helping to increase both performance and security.

How do you ensure my data remains secure in a multi-tenant environment?

Hadoop in the Cloud was designed with data security as a priority. Each Hadoop cluster is deployed as its own entity and within its own virtualised environment from a storage, processing and management perspective. This, coupled with all HDFS data being stored on a physical drive exclusive to a single tenant’s virtual node, helps ensure the highest level of data security and assurance.
HADOOP-FIG1

Will UKCloud manage rolling point Hadoop releases?

We will monitor the release of any minor, major and security patch releases, and test them on our own platforms. We won’t automatically apply updates, but will present our testing, update packages and blueprints to enable customers to apply patches at their own discretion.

What is the HDFS data replication factor for Hadoop in the Cloud?

UKCloud has fixed the HDFS data replication factor to a multiple of three. This factor is in line with established Hadoop practices, and helps keep costs for the service to a minimum.

How large can I grow my Hadoop cluster?

UKCloud is confident that our Hadoop in the Cloud service is capable of operating at a scale more than large enough to deal with the majority of Hadoop use cases and production workloads.

Does UKCloud offer any scheduled automated backup for Hadoop in the Cloud?

There is no scheduled automated backup for this service as Hadoop’s storage engine, HDFS, is engineered with infrastructure failure in mind. That means localised component failures are tolerated within the infrastructure via data replication, eliminating single points of failure (including physical host failure or disk failure).

Hadoop v2.4.1+ allows for manual creation of snapshots of HDFS, which can be stored offline using our Cloud Storage.

Is your Hadoop in the Cloud service extensible to offer additional analytics and visualisation tools?

We have engineered the service to enable customers to provision their own analytics, business intelligence and visualisation tools on our Compute service line with full, reduced-latency connectivity to Hadoop in the Cloud.

Does Hadoop in the Cloud support active/active replication of my cluster between your two data centres?

Currently our service offers only a single active cluster from either of our data centres.

Active/passive clusters could be configured using our low-latency dedicated connectivity to enable synchronous replication, but the customer or partner would be responsible for supporting this configuration.

Third-party tools for active/active Hadoop clusters are available, but we would not be responsible for the design, implementation, testing or support of these tools.

What are Velocity Packs?

To maximise cost flexibility against user requirements, UKCloud offers three Hadoop cluster types for customers to choose from, based on their initial Hadoop data requirements, coupled with their projected velocity of future data ingest.

Can I mix and match different Velocity Packs?

It is currently not possible to mix and match cluster node types within a single cluster (for example, a low-velocity cluster can only scale out with low-velocity slave nodes).

How is Hadoop in the Cloud supported?

We manage and support the Hadoop core platform using our dedicated support team based in the UK. Support is available via helpdesk ticket, phone or email.

Will my cluster performance increase, the more worker nodes I deploy?

Owing to the way Hadoop places and queries data, the more worker nodes the cluster can spread its data across, the faster performance becomes.

How do you ensure the performance and resilience of Hadoop in a virtualised environment?

We’ve used Big Data Extension and Hadoop Virtual Extension technologies to create rack, host and node awareness within our virtual data centre, to help ensure the best placement of nodes from a performance and resilience perspective.