Monitoring Apache Spark with Hadoop environments
Apache Spark technologies along with Apache Hadoop are commonly used for Big Data-level processing. At the same time, it is difficult to properly monitor processing, especially in distributed environments, when a single application is executed on multiple worker nodes.
We are the authors of an extension that collects the most important metrics provided by the API Apache Spark, allowing them to be properly correlated with measurements taken by APM tools running on hosts. This allows us to have a live view of which application and executor worked on the given worker. We know how many resources have been allocated, whether the stage or job has ended in an error. Having a complete set of information, we can easily tell with which processing our environment is not doing well, and therefore where we should start optimizing.