Investigating Resource Usage and Interaction Between Simultaneously Executing Deep Neural Network Frameworks Using Mesos

Alexander and Michael Roberts |

EECS 582 Fall 2016

Problem Statement
We are unsure of the performance of deep learning frameworks when run individually and simultaneously in a distributed environment. With n applications running on m machines, we expect cluster managers like mesos will provide resource independence and thus allow each to perform as though they were allocated n/m resources. Does the measured performance meet this expected performance? Why or why not? Will one application dominate the other in a distributed environment? Why or why not?

Machine learning is becoming an increasingly important tool in business today, allowing companies to harness user data more effectively to deliver a more individually-tailored experience to customers. With the prominence of data sources today, the ability to do efficient and rapid computation on big data sets in a distributed environment is crucial for a business to be able to utilize this data. Thus, we hope to discover and investigate any issues that are spawned from the interactions of multiple simultaneously executing frameworks.

We plan to split our work into four phases: (1) Installing TensorFlow and Caffe deep learning frameworks on a single node and developing sample, equivalent workloads for them. This will allow us to gain an understanding of the uses and intricacies of each framework, as well as find a baseline for which they can be directly compared. (2) Obtaining or simulating a cluster of machines, installing a cluster manager such as Apache Mesos, and configuring our deep learning frameworks to interact with a cluster of machines. This will allow us to run our sample workloads individually to gain insight into this process as well as take performance metrics for each run independently. (3) Running each framework simultaneously on the cluster and investigating the performance of each to how well the system isolates and shares resources. (4) Attempting to discover how any issues may be mitigated or designed around.

Eval Plan
We plan to profile the execution of the machines for disk, memory, CPU, network, and GPU usage (if GPU support is provided in our infrastructure). Profiling will allow us to find which resource acts as a bottleneck for the processes and if this bottleneck changes for distributed or simultaneous execution, and if that bottleneck matches our expectations. Overall we hope to take metrics of speed and resource usage to find what type of degradation in performance executing simultaneously will result in. We also hope to discover if one application “dominates” the other in terms of resource allocation and why or why not this may happen, although our method of tracking this is currently less concrete.