A to Z of Google Cloud Platform a personal selection — Y — for YARN
I was absolutely defeated by this letter. I’ve been asking for a couple of weeks for suggestions and the same subjects kept being thrown my way . So I just wanted to thank @complex and @sveesible for both suggesting YARN and keeping me on track to meet my said goal of finishing this series in June and keeping mostly everything consecutive letter wise .
YARN — This stands for yet another resource negotiator . It was introduced as part of the Hadoop v2 architecture.
YARN actually decouples MapReduce’s resource management and scheduling capabilities from the data processing component. Why is this a thing? well before YARN the Hadoop JobTracker limited scalability and was also a single point of failure .
So from a tightly coupled configuration found in Hadoop v1 the introduction of YARN changed the model as illustrated in the diagram below (As I’m talking about GCP in these series of posts GCS is accurate in the storage part of Hadoop v2 in the diagram):
With the introduction of YARN this allowed Hadoop to support more varied processing approaches and a broader array of applications such that you can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. Some of these other tools are illustrated in the diagram above.
The diagram below from the apache YARN pages illustrates the architecture
I’m an old skool Hadoop/ MapReduce user so YARN isn’t something that was around when I was initially getting my hands dirty with Hadoop but I wish it had been !
I won’t spend time talking about the history of Hadoop v1 and Hadoop v2 there are some excellent posts out on the interwebs that goes into that. This is a series about GCP so I’ll focus on where YARN fits into the GCP ecosystem now.
GCP has a fully managed service that lets you run the Apache Spark and Apache Hadoop ecosystem called Cloud Dataproc . It provisions big or small clusters rapidly, supports many popular job types. YARN is integrated with this service . Starting a Dataproc cluster from the Cloud console this integration is obvious.
You can start or stop a Dataproc cluster in a number of ways including using the YARN web interface . The interface can be found on your Dataproc cluster master node on port 8088.
To access this interface it is highly recommend to use a SSH tunnel to create a secure connection to the master node. The SSH tunnel supports traffic proxying using the SOCKS protocol. This means that you can send network requests through the SSH tunnel in any browser that supports the SOCKS protocol allowing you to transfer all of your browser data over SSH, eliminating the need to open firewall ports to access the web interfaces. See the docs for how to do this.
Through this interface you can also monitor your cluster in addition to using the Cloud console or command line. This integration means you can continue to use the tools you are already used to
You can adjust the YARN parameters in the yarn-site.xml
As well as system logs application logs like the YARN logs are also forwarded to Cloud Logging ( see my entry for L in this series for a bit about Logging)