Hadoop in the Cloud
Built upon our UK Government accredited Infrastructure as a Service (IaaS) environments
We believe in the massive potential that data-driven decisions have in improving public sector outcomes. There are significant opportunities to connect and federate datasets, and to use emerging analytic solutions to derive new insight. But these analytic solutions require a significant platform. Our Hadoop in the Cloud service provides that platform at the highest levels of performance and security, without sacrificing the commercial benefits or flexibility of the cloud. You have a choice of Hortonworks or Cloudera-certified Hadoop clusters, each deployed on optimised multi-tenant infrastructure, with the following single-tenant elements:
- Single-tenant cluster. Each customer is provisioned with their own isolated Hadoop environment – virtual master nodes, virtual slave nodes and virtual cluster managers aren’t shared with any other tenants.
- Dedicated 1:1 mapping of slave nodes to directly attached physical drives helps increase overall HDFS performance, scalability and security.
|Petabyte scale HDFS storage to power many Big Data analytics solutions||Reduced the complexity of managing secure data|
|Optimised infrastructure engineered specifically for Hadoop||Rapid one-click provisioning of a Big Data solution|
|Hadoop v2 supporting YARN analytics functionality||Reduce cost traditionally associated with Big Data|
|Data integrity and enhanced availability through Hadoop’s built-in replication||Increases innovation and exploration of Big Data opportunities|
|Built in graphical management tools, simplifies cluster management & administration||Utility consumption, scale resources as and when required|
|Extensible analytics; HBase, Hive, Pig, Impala, Spark, Storm||Ready to use immediately once deployed with minimal configuration|
|Extensible data ingest; Sqoop, Flume, Kafka||Storage of heterogeneous data in a single data store|
|Choice of certified Cloudera Enterprise Manager or Hortonworks HDP distributions||Reduces data replication and encourages data recycling|
|Elastic Analytic modules, providing greater commercial and technical flexibility||Converged data analytics and intelligence, increases value from data|
|Optimised for OFFICIAL, designed for OFFICIAL and OFFICIAL-SENSITIVE data||Specific to the data protection needs of UK public sector|
Hadoop in the Cloud service information:
- Choice of the most popular 100% open-source Hadoop distributions – Cloudera Enterprise or Hortonworks HDP
- A true on-demand OPEX-based service which eliminates traditional CAPEX challenges
- A fully configured deployment which helps reduce the level of expertise needed to get started with the fundamentals of Hadoop
- Removal of the capacity management burden traditionally associated with Hadoop
- An extensible solution which enables you to use any Hadoop-compatible business intelligence, analytics and reporting applications
- Uncontended resources reduce any adverse performance issues as a result of noisy neighbours
- Proximity of Hadoop data to compute resources increased processing performance and removes data transfer charges between compute and storage
- An unrivalled big data and analytics partner ecosystem
Hadoop in the Cloud is powered by VMware vSphere or OpenStack for the underlying assured IaaS service, with customers choosing Cloudera Enterprise or Hortonworks® HDP deployed on top as the enabling Hadoop distribution.
The platform consists of the following modules and associated supporting services:
- Hadoop Common: the common utilities that support the other Hadoop modules
- Hadoop Distributed File System (HDFS™): a distributed file system that provides high-throughput access to application data
- Hadoop YARN: a framework for job scheduling and cluster resource management
- Hadoop MapReduce v2: a YARN-based system for parallel processing of large datasets
- Hadoop Zookeeper – a distributed transaction coordinator
The Hadoop core platform facilitates data ingress and egress along with native MapReduce v2 applications. In addition, we provide access to Hadoop cluster management via API or GUI (depending on the chosen distribution – Cloudera Manager or Ambari), for you to use to manage and maintain your entire Hadoop cluster.
We monitor all infrastructure and platform components related to the embedded Hadoop core platform as defined above, and provide ‘reasonable endeavours’ support for all associated services that interface with your analytics solution.
We provide a specific virtual data centre (VDC) alongside the Hadoop cluster to enable you to deploy your own analytics, business intelligence and reporting applications. We charge for and support any VMs within this specific VDC in line with our Enterprise Compute Cloud or Cloud Native Infrastructure service definitions. For clarity, UKCloud isn’t responsible for managing any image, software or service deployed within these VMs as we give you complete autonomy to achieve your desired outcome.
Figure 2. Hadoop in the Cloud overview
This service is extensible, as shown in Figure 2, allowing additional analytic applications to be deployed to facilitate functions beyond data ingest, egress and MapReduce v2 jobs. These depend on the choice of distribution and may include:
- Cloudera Impala – a massively parallel processing (MPP) SQL query engine
- Cloudera Search – for non-technical users to search and explore data
- ApacheTM Spark – an in-memory data store
- ApacheTM Falcon – for data management and pipeline processing
- ApacheTM Kafka – a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system
Management and maintenance of any additional applications deployed by you, beyond the initial deployment, remain the responsibility of you or your chosen partners.
UKCloud’s Hadoop in the Cloud lets you scale up HDFS storage resources as required, yet retain elasticity around analytics requirements. You specify your solution according to the storage (TiB) required.
Additional Service Options
Global Load Balancing
- Enables you to direct internet traffic across multiple end-points (which can be across different sites or different platforms)
- UKCloud subcontracts this service to Neustar, global leaders in network and security services
- Requires a 12-month minimum term commitment
Application-tuned DDoS protection
- A domain-based service that can be finely tuned to the profile of specific applications and workloads
- Complements UKCloud’s platform-level DDoS protection
- UKCloud subcontracts this service to Neustar, global leaders in network and security services
- Requires a 12-month minimum term commitment
Q What is the service?
Hadoop in the Cloud is UKCloud’s highly secure PaaS implementation of Hadoop. It provides a cloud-based solution to help organisations address the challenges of big data storage and processing.
Q Why deliver Hadoop as a cloud service?
The service enables organisations to explore a highly connected, secure, stable solution that’s optimised for big data, from proof of concept through to production workloads — while minimising the investment, time and risk associated with buying, provisioning, configuring and maintaining Hadoop infrastructure, platforms and licenses.
Q What modules constitute the Hadoop core platform?
The Hadoop core platform consists of the following modules and associate supporting services:
- Hadoop Common. The common utilities that support the other Hadoop modules
- Hadoop Distributed File System (HDFS™). A distributed file system that provides high-throughput access to application data
- Hadoop YARN. A framework for job scheduling and cluster resource management
- Hadoop MapReduce v2. A YARN-based system for parallel processing of large datasets
The modules in the Hadoop core platform facilitate data ingress and egress, and native MapReduce v2 applications.
Q How did UKCloud define its Hadoop core platform?
To deliver a quality services, we identified the boundaries of Hadoop in order to make a clear delineation between UKCloud-provided and -supported services, and customer/third-party services. We’ve adopted the industry definition of Hadoop as per the Apache Foundation http://hadoop.apache.org/
Q What is the HDFS data replication factor for Hadoop in the Cloud?
UKCloud has fixed the HDFS data replication factor to a multiple of three. This factor is in line with established Hadoop practices, and helps keep costs for the service to a minimum.
Q How large can I grow my Hadoop cluster?
UKCloud is confident that our Hadoop in the Cloud service is capable of operating at a scale more than large enough to deal with the majority of Hadoop use cases and production workloads.
Q What Hadoop distributions does Hadoop in the Cloud support?
This service currently supports Hortonworks® HDP and Cloudera Enterprise. We will continue to review supporting additional distributions according to market demand.
Q Is your Hadoop in the Cloud service extensible to offer additional analytics and visualisation tools?
We have engineered the service to enable customers to provision their own analytics, business intelligence and visualisation tools on our Compute service line with full, reduced-latency connectivity to Hadoop in the Cloud.
Q How is Hadoop in the Cloud billed?
The service is a true cloud service, billed by the hour based on the storage consumed, with no upfront cost, minimum commitment or early exit fees.
Q Does UKCloud offer a free trial?
We offer a 30-day free trial so that you can test and evaluate our service without commitment. Your trial provides you with a live environment on the UKCloud platform to test our services and verify whether they are suited to your needs.
Q What are velocity packs?
To maximise cost flexibility against user requirements, UKCloud offers three Hadoop cluster types for customers to choose from, based on their initial Hadoop data requirements, coupled with their projected velocity of future data ingest.
Q Can I mix and match different velocity packs?
It is currently not possible to mix and match cluster node types within a single cluster (for example, a low-velocity cluster can only scale out with low-velocity slave nodes).
Q Where is the service hosted?
The service is delivered by a UK company from two tier 3 UK data centres separated by more than 100km, and securely connected by high-bandwidth, low-latency dedicated connectivity.
Q Does my data leave the UK?
As the service is delivered from UK data centres by a UK company, your data does not leave the UK when at rest.
Q Can I use Hadoop in the Cloud in the UKCloud Elevated (previously IL3) domain?
Yes, Hadoop in the Cloud is available in both the OFFICIAL Assured and Elevated domains.
Q How do you ensure my data remains secure in a multi-tenant environment?
Hadoop in the Cloud was designed with data security as a priority. Each Hadoop cluster is deployed as its own entity and within its own virtualised environment from a storage, processing and management perspective. This, coupled with all HDFS data being stored on a physical drive exclusive to a single tenant’s virtual node, helps ensure the highest level of data security and assurance.
Q Is the service Pan Government Accredited?
UKCloud’s existing PGA still applies to the infrastructure underpinning our services, but since the move to the Government Security Classification Policy (GSCP), we are no longer able to seek PGA for new services such as Hadoop in the Cloud.
We are now required to self-assert our services, with customers then responsible for assessing and selecting the most appropriate cloud services which meet their individual security requirements.
We provide confidence that the service still meets the highest level of information assurance, which is why we continue to conduct independent testing and validation of our platform, and have the findings made available to our customers and partners, thereby enabling their SIROs to make an informed decision about self-asserting any service they choose to consume.
Q How is Hadoop in the Cloud supported?
We manage and support the Hadoop core platform using our dedicated support team based in the UK. Support is available via helpdesk ticket, phone or email.
Q Will UKCloud manage rolling point Hadoop releases?
We will monitor the release of any minor, major and security patch releases, and test them on our own platforms. We won’t automatically apply updates, but will present our testing, update packages and blueprints to enable customers to apply patches at their own discretion
Q Can I use Hadoop in the Cloud over closed networks such as PSN and N3/HSCN?
The service is accredited for use over PSN. Connectivity to the N3/HSCN network will be considered when an appropriate sponsor submits a requirement.
DATA PERFORMANCE AND RESILIENCE
Q What is the underlying storage technology for the service?
We designed our platform to be optimised specifically for Hadoop, in line with best practices established by VMware and the Hadoop community.
Unlike some Hadoop cloud service providers, we give each node VM exclusive access to a physical drive attached directly to the host, helping to increase both performance and security.
Q How do you minimise the traditional cloud data access latency for Hadoop?
The use of localised, single-tenant physical drives, direct-attached to a virtual node overcomes the traditional cloud data access latency issues that occur when a virtual machine has to pull data from a SAN across the network.
Q How do you ensure the performance and resilience of Hadoop in a virtualised environment?
We’ve used Big Data Extension and Hadoop Virtual Extension technologies to create rack, host and node awareness within our virtual data centre, to help ensure the best placement of nodes from a performance and resilience perspective.
Q Will my cluster performance increase, the more worker nodes I deploy?
Owing to the way Hadoop places and queries data, the more worker nodes the cluster can spread its data across, the faster performance becomes.
Q Why is my cluster or newly added node faster after it has been running for a while?
When a cluster or node is initially deployed, the disks may take some time to warm up. This is because the disks are start out blank and there is an initial write penalty. Once the disk has been written to, the performance will improve.
Q Does UKCloud offer any scheduled automated backup for Hadoop in the Cloud?
There is no scheduled automated backup for this service as Hadoop’s storage engine, HDFS, is engineered with infrastructure failure in mind. That means localised component failures are tolerated within the infrastructure via data replication, eliminating single points of failure (including physical host failure or disk failure).
Hadoop v2.4.1+ allows for manual creation of snapshots of HDFS, which can be stored offline using our Cloud Storage.
Q Does Hadoop in the Cloud support active/active replication of my cluster between your two data centres?
Currently our service offers only a single active cluster from either of our data centres.
Active/passive clusters could be configured using our low-latency dedicated connectivity to enable synchronous replication, but the customer or partner would be responsible for supporting this configuration.
Third-party tools for active/active Hadoop clusters are available, but we would not be responsible for the design, implementation, testing or support of these tools.