Contextual search for datasets in CDAP

The landing page for CDAP metadata search

When you’re in charge of managing large amounts of data, the data about that data — or, its metadata — can be just as important as the data itself.

Imagine, for instance, sifting through a large cloud drive for a paper you wrote around two summers ago. You might recall that you wrote it between June 2017 and August 2017, and that the title had something to do with grape vineyards. So, you search within those date boundaries, and for the keywords “grape” and “vineyard.” …


November 6, 2017

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology, driving company engineering initiatives and collaboration. Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

It is no secret that traditional platforms for data analysis, like data warehouses, are difficult and expensive to scale, to meet the current data demands for storage and compute. And purpose-built platforms designed to process big data often require significant up-front and on-going investment if deployed on-premise. Alternatively, cloud computing is the perfect vehicle to scale and accommodate such large volumes of data in an economical way. While the economics are right, enterprises migrating their on-premises data warehouses or building a new warehouse or data lake in the cloud face many challenges along the way. …


October 24, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

Over the last few years, the popularity of cloud-based software development has risen dramatically, along with the need for sharing development assets and resources within and across organizations. Containers and open source have simplified the sharing and cloning of code and entire dev/test environments, taking efficiency, collaboration and productivity of product engineering organization to new levels. In this blog we take an internal view of some of the challenges software engineers at Cask have to tackle, and how cloud-based self-service tools have enabled our team to be more productive using Google Compute Platform.

Self-service cluster provisioning for developers

Engineers at Cask develop…


September 13, 2017

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology, driving company engineering initiatives and collaboration. Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

I am always puzzled that people think that “Big Data” is only about archiving massive amounts of data. The disruption in the market was not because companies could archive large amounts of data, but because the business value was realized by insights generated through analytics on the data aggregated. Proven by early adopters, the potential to create new business models opened doors to new business opportunities, improving customer experience and much more, and lead to the success of “Big Data”. These initial pockets of success lead to creating a new market, a new industry, and with those…


September 7, 2017

Bhooshan Mogal is a Software Engineer at Cask, where he is working on making data application development fun and simple. Before Cask, he worked on a unified storage abstraction for Hadoop at Pivotal and personalization systems at Yahoo.

Data Engineering groups in large enterprises are typically decentralized. Teams develop specialized skill sets in particular areas of data processing, and have specific charters. For example, a team may be responsible for data acquisition. Another may be responsible for cleansing, transforming, normalizing and analyzing data. Another team of data scientists may be responsible for consuming this data, and applying machine learning models to derive insights from data. This results in the creation of complex data processing dependencies in a large enterprise. Typically, these dependencies are events generated by a given process that another process may depend on…


August 30, 2017

Nitin Motgi is Founder and CTO of Cask, where he is responsible for developing the company’s long-term technology, driving company engineering initiatives and collaboration. Prior to Cask, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E.

We would like to thank all our users and customers for the great conversations we have had around use cases, the challenges you face with operationalizing a data lake and/or building data analytics solutions, and your candid feedback on CDAP usability. These interactions are invaluable and we always love hearing from you. You have offered a lot of insights to our product team on how to make CDAP even better.

In this blog, we will describe the enhancements we made in the latest release of CDAP, after internalizing your feedback. …


August 23, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

Enterprise challenges

Hadoop has emerged as the leading technology to solve a number of big data use cases.

However, enterprises needing to solve their business problems often need to piece together different technologies to build a solution. Each component in the Hadoop technology stack is infrastructure focused and purpose-built to solve a unique set of problems. An enterprise that wants to solve a business use case — for example, a managed data lake — will need to spend a lot of time integrating these technologies to build the solution they need. Enterprises are also challenged with the talent gap…


July 18, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

Cask Data Application Platform (CDAP) is a platform-agnostic unified integration platform that allows users to run, manage and deploy big data applications independent of distros on-premises, in the cloud or in a hybrid environment. We recently announced the availability of Cloud Sandbox for AWS, and in order to continue to give our customers and users more options to try and experience CDAP, we are very happy to announce the availability of CDAP Cloud Sandbox on Microsoft Azure!

The Cloud Sandbox gives CDAP users more choices to experience CDAP. The Cloud Sandbox is a fully configured and functional…


June 29, 2017

Derek Wood is a DevOps Engineer at Cask where he is building tools to manage and operate the next generation of Big Data applications. Prior to Cask, Derek ran large scale distributed systems at Wells Fargo and at Yahoo!, where he was the senior engineering lead for the CORE content personalization platform

Today, we are announcing the availability of CDAP Cloud Sandbox on AWS. The Cloud Sandbox gives CDAP users more choices to experience CDAP. Cloud Sandbox is a fully configured and functional version of Cask’s flagship offering, but scale-limited to a single node instance. Developers can provision an instance of Cloud Sandbox on any of the AWS regions with a click of a button and experience the power of CDAP without having to setup or configure Hadoop clusters. Cloud Sandbox simplifies the evaluation and testing thereby providing a quicker way to get productive with CDAP.

CDAP Cloud Sandbox…


June 7, 2017

Sreevatsan Raman is the Head of Engineering at Cask where he is driving the company’s engineering initiatives. Prior to Cask, Sree designed and implemented big data infrastructure at Klout and Yahoo!

I am very happy to announce the general availability of Cask Data Application Platform 4.2. Over the last few months, we have been focussing on enhancing the user experience and usability of the product — CDAP 4.2 comes with several features that offer a great first five minute user experience of new users of CDAP via enhanced self service Data Preparation capabilities, improved interactive Apache Spark experience in CDAP Data pipelines, and Change Data Capture (CDC) from SQL Server and Oracle. In addition, we have several new platform enhancements that enable event driven data schedules, broaden distro…

cdapio

A 100% open source framework for building data analytics applications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store