Добавить
Уведомления

Sharing is Caring – Fractional GPU Allocations With MetaGPU Device Plugin - Dmitry Kartsev, cnvrg.i

Sharing is Caring – Fractional GPU Allocations With MetaGPU Device Plugin - Dmitry Kartsev, cnvrg.io We all know that once we have a GPU device in our Kubernetes cluster, we strive to utilize it as much as we can. However, K8s doesn't allow us to share a single GPU, which in many cases can lead to underutilized GPU and a waste of resources, which becomes crucial when talking about AI/ML workloads. Some of the latest GPU generations provide a sort of sharing capabilities (for example Nvidia MiG), however the older generations do not have such an option. In addition, enabling and making re-sharing might become a less trivial task in production environments for MLOps/Data engineers. To address the above problems, my team and I released an open source project which we call MetaGPU. The MetaGPU project includes Kubernetes device plugin, metrics exporter and CLI tools which together allow you to dynamically configure GPU sharing with zero downtime, as well as the ability to share a single GPU device to different amounts of shares, enforce GPU memory usage and more. In this session I will share my experience, tips, challenges and lessons learned from developing and operating a production grade fractional GPU on Kubernetes cluster.

Иконка канала JS Веб-гуру
73 подписчика
12+
19 просмотров
2 года назад
12+
19 просмотров
2 года назад

Sharing is Caring – Fractional GPU Allocations With MetaGPU Device Plugin - Dmitry Kartsev, cnvrg.io We all know that once we have a GPU device in our Kubernetes cluster, we strive to utilize it as much as we can. However, K8s doesn't allow us to share a single GPU, which in many cases can lead to underutilized GPU and a waste of resources, which becomes crucial when talking about AI/ML workloads. Some of the latest GPU generations provide a sort of sharing capabilities (for example Nvidia MiG), however the older generations do not have such an option. In addition, enabling and making re-sharing might become a less trivial task in production environments for MLOps/Data engineers. To address the above problems, my team and I released an open source project which we call MetaGPU. The MetaGPU project includes Kubernetes device plugin, metrics exporter and CLI tools which together allow you to dynamically configure GPU sharing with zero downtime, as well as the ability to share a single GPU device to different amounts of shares, enforce GPU memory usage and more. In this session I will share my experience, tips, challenges and lessons learned from developing and operating a production grade fractional GPU on Kubernetes cluster.

, чтобы оставлять комментарии