GPU Programming with CUDA培訓
CUDA 是 GPU 程式設計的開放標準,它使代碼能夠在 NVIDIA GPU 上運行,NVIDIA GPU 廣泛用於高性能計算、人工智慧 (AI)、遊戲和圖形。CUDA 向程式師公開硬體細節,並完全控制並行化過程。但是,這也需要對設備架構、記憶體模型、執行模型和優化技術有很好的理解。
這種由講師指導的現場培訓(在線或現場)面向希望使用 CUDA 對 NVIDIA GPU 進行程式設計並利用其並行性的初級到中級開發人員。
在本次培訓結束時,參與者將能夠:
- 設置一個開發環境,其中包括 CUDA 工具包、NVIDIA GPU 和 Visual Studio 代碼。
- 創建一個基本的 CUDA 程式,該程式在 GPU 上執行向量加法並從 GPU 記憶體中檢索結果。
- 使用 CUDA API 查詢設備資訊、分配和釋放設備記憶體、在主機和設備之間複製數據、啟動內核和同步線程。
- 使用 CUDA C/C++ 語言編寫在 GPU 上執行並操作數據的內核。
- 使用 CUDA 內建函數、變數和庫來執行常見任務和操作。
- 使用 CUDA 記憶體空間(例如全域、共用、常量和本地)來優化數據傳輸和記憶體訪問。
- 使用 CUDA 執行模型來控制定義並行度的線程、塊和網格。
- 使用 CUDA-GDB、CUDA-MEMCHECK 和 NVIDIA Nsight 等工具調試和測試 CUDA 程式。
- 使用合併、緩存、預取和分析等技術優化 CUDA 程式。
課程形式
- 互動講座和討論。
- 大量的練習和練習。
- 在即時實驗室環境中動手實施。
課程自定義選項
- 要申請本課程的定製培訓,請聯繫我們進行安排。
- 96% de clients satisfaits
課程簡介
介紹
- 什麼是CUDA?
- CUDA 與 OpenCL 與 SYCL
- CUDA 功能和架構概述
- 設置開發環境
開始
- 使用 Visual Studio 代碼創建新的 CUDA 專案
- 瀏覽項目結構和檔
- 編譯和運行程式
- 使用 printf 和 fprintf 顯示輸出
CUDA API
- 瞭解 CUDA API 在主機程式中的作用
- 使用 CUDA API 查詢設備資訊和功能
- 使用 CUDA API 分配和釋放設備記憶體
- 使用 CUDA API 在主機和設備之間複製數據
- 使用 CUDA API 啟動內核和同步線程
- 使用 CUDA API 處理錯誤和異常
CUDA C/C++
- 瞭解 CUDA C/C++ 在設備程式中的作用
- 使用 CUDA C/C++ 編寫在 GPU 上執行的內核並操作數據
- 使用 CUDA C/C++ 資料類型、限定符、運算符和表示式
- 使用 CUDA C/C++ 內置函數,如 math、atomic、warp 等。
- 使用 CUDA C/C++ 內置變數,如 threadIdx、blockIdx、blockDim 等。
- 使用 CUDA C/C++ 庫,例如 cuBLAS、cuFFT、cuRAND 等。
CUDA 記憶體模型
- 瞭解主機和設備記憶體模型之間的差異
- 使用 CUDA 記憶體空間,例如全域、共用、常量和本地
- 使用 CUDA 記憶體物件,例如指標、陣列、紋理和表面
- 使用CUDA記憶體訪問模式,如唯讀、只寫、讀寫等。
- 使用 CUDA 記憶體一致性模型和同步機制
CUDA 執行模型
- 瞭解主機和設備執行模型之間的區別
- 使用 CUDA 線程、塊和網格來定義並行度
- 使用 CUDA 線程函數,例如 threadIdx、blockIdx、blockDim 等。
- 使用 CUDA 塊函數,例如 __syncthreads、__threadfence_block 等。
- 使用 CUDA 網格函數,例如 gridDim、gridSync、協作組等。
調試
- 瞭解 CUDA 程式中的常見錯誤和錯誤
- 使用 Visual Studio 代碼調試器檢查變數、斷點、調用堆疊等。
- 在 Linux 上使用 CUDA-GDB 調試 CUDA 程式
- 使用 CUDA-MEMCHECK 檢測記憶體錯誤和洩漏
- 使用 NVIDIA Nsight 在 Windows 上調試和分析 CUDA 程式
優化
- 了解影響 CUDA 程式性能的因素
- 使用 CUDA 合併技術提高記憶體輸送量
- 使用 CUDA 快取技術來減少記憶體延遲
- 使用 CUDA 共用記憶體和本地記憶體技術優化記憶體訪問和頻寬
- 使用 CUDA 分析和分析工具來衡量和改進執行時間和資源利用率
摘要和後續步驟
最低要求
- 瞭解 C/C++ 語言和並行程式設計概念
- 計算機體系結構和記憶體層次結構的基礎知識
- 具有命令行工具和代碼編輯器的經驗
觀眾
- 希望學習如何使用 CUDA 對 NVIDIA GPU 進行程式設計並利用其並行性的開發人員
- 希望編寫可在不同 CUDA 設備上運行的高性能和可擴展代碼的開發人員
- 希望探索 GPU 程式設計的低級方面並優化其代碼性能的程式師
Open Training Courses require 5+ participants.
GPU Programming with CUDA培訓 - Booking
GPU Programming with CUDA培訓 - Enquiry
GPU Programming with CUDA - 咨詢詢問
咨詢詢問
Provisional Upcoming Courses (Require 5+ participants)
相關課程
Developing AI Applications with Huawei Ascend and CANN
21 時間:Huawei Ascend is a family of AI processors designed for high-performance inference and training.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI engineers and data scientists who wish to develop and optimize neural network models using Huawei’s Ascend platform and the CANN toolkit.
By the end of this training, participants will be able to:
- Set up and configure the CANN development environment.
- Develop AI applications using MindSpore and CloudMatrix workflows.
- Optimize performance on Ascend NPUs using custom operators and tiling.
- Deploy models to edge or cloud environments.
Format of the Course
- Interactive lecture and discussion.
- Hands-on use of Huawei Ascend and CANN toolkit in sample applications.
- Guided exercises focused on model building, training, and deployment.
Course Customization Options
- To request a customized training for this course based on your infrastructure or datasets, please contact us to arrange.
Deploying AI Models with CANN and Ascend AI Processors
14 時間:CANN (Compute Architecture for Neural Networks) is Huawei’s AI compute stack for deploying and optimizing AI models on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI developers and engineers who wish to deploy trained AI models efficiently to Huawei Ascend hardware using the CANN toolkit and tools such as MindSpore, TensorFlow, or PyTorch.
By the end of this training, participants will be able to:
- Understand the CANN architecture and its role in the AI deployment pipeline.
- Convert and adapt models from popular frameworks to Ascend-compatible formats.
- Use tools like ATC, OM model conversion, and MindSpore for edge and cloud inference.
- Diagnose deployment issues and optimize performance on Ascend hardware.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab work using CANN tools and Ascend simulators or devices.
- Practical deployment scenarios based on real-world AI models.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
GPU Programming on Biren AI Accelerators
21 時間:Biren AI Accelerators are high-performance GPUs designed for AI and HPC workloads with support for large-scale training and inference.
This instructor-led, live training (online or onsite) is aimed at intermediate-level to advanced-level developers who wish to program and optimize applications using Biren’s proprietary GPU stack, with practical comparisons to CUDA-based environments.
By the end of this training, participants will be able to:
- Understand Biren GPU architecture and memory hierarchy.
- Set up the development environment and use Biren’s programming model.
- Translate and optimize CUDA-style code for Biren platforms.
- Apply performance tuning and debugging techniques.
Format of the Course
- Interactive lecture and discussion.
- Hands-on use of Biren SDK in sample GPU workloads.
- Guided exercises focused on porting and performance tuning.
Course Customization Options
- To request a customized training for this course based on your application stack or integration needs, please contact us to arrange.
Cambricon MLU Development with BANGPy and Neuware
21 時間:Cambricon MLUs (Machine Learning 单元) 是专为边缘和数据中心场景中的推理和训练优化的AI芯片。
本次由讲师指导的培训(线上或线下)面向中级开发者,旨在帮助他们使用BANGPy框架和Neuware SDK在Cambricon MLU硬件上构建和部署AI模型。
通过本次培训,参与者将能够:
- 设置和配置BANGPy和Neuware开发环境。
- 开发和优化基于Python和C++的模型,适用于Cambricon MLUs。
- 将模型部署到运行Neuware运行时的边缘和数据中心设备。
- 将ML工作流与MLU特定的加速功能集成。
课程形式
- 互动式讲座和讨论。
- 动手实践,使用BANGPy和Neuware进行开发和部署。
- 指导练习,专注于优化、集成和测试。
课程定制选项
- 如需根据您的Cambricon设备型号或使用场景定制本次培训,请联系我们安排。
Introduction to CANN for AI Framework Developers
7 時間:CANN (Compute Architecture for Neural Networks) is Huawei’s AI computing toolkit used to compile, optimize, and deploy AI models on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at beginner-level AI developers who wish to understand how CANN fits into the model lifecycle from training to deployment, and how it works with frameworks like MindSpore, TensorFlow, and PyTorch.
By the end of this training, participants will be able to:
- Understand the purpose and architecture of the CANN toolkit.
- Set up a development environment with CANN and MindSpore.
- Convert and deploy a simple AI model to Ascend hardware.
- Gain foundational knowledge for future CANN optimization or integration projects.
Format of the Course
- Interactive lecture and discussion.
- Hands-on labs with simple model deployment.
- Step-by-step walkthrough of the CANN toolchain and integration points.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
CANN for Edge AI Deployment
14 時間:Huawei's Ascend CANN toolkit enables powerful AI inference on edge devices such as the Ascend 310. CANN provides essential tools for compiling, optimizing, and deploying models where compute and memory are constrained.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI developers and integrators who wish to deploy and optimize models on Ascend edge devices using the CANN toolchain.
By the end of this training, participants will be able to:
- Prepare and convert AI models for Ascend 310 using CANN tools.
- Build lightweight inference pipelines using MindSpore Lite and AscendCL.
- Optimize model performance for limited compute and memory environments.
- Deploy and monitor AI applications in real-world edge use cases.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab work with edge-specific models and scenarios.
- Live deployment examples on virtual or physical edge hardware.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Understanding Huawei’s AI Compute Stack: From CANN to MindSpore
14 時間:Huawei’s AI stack — from the low-level CANN SDK to the high-level MindSpore framework — offers a tightly integrated AI development and deployment environment optimized for Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at beginner-level to intermediate-level technical professionals who wish to understand how the CANN and MindSpore components work together to support AI lifecycle management and infrastructure decisions.
By the end of this training, participants will be able to:
- Understand the layered architecture of Huawei’s AI compute stack.
- Identify how CANN supports model optimization and hardware-level deployment.
- Evaluate the MindSpore framework and toolchain in relation to industry alternatives.
- Position Huawei's AI stack within enterprise or cloud/on-prem environments.
Format of the Course
- Interactive lecture and discussion.
- Live system demos and case-based walkthroughs.
- Optional guided labs on model flow from MindSpore to CANN.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Optimizing Neural Network Performance with CANN SDK
14 時間:CANN SDK (Compute Architecture for Neural Networks) is Huawei’s AI compute foundation that allows developers to fine-tune and optimize the performance of deployed neural networks on Ascend AI processors.
This instructor-led, live training (online or onsite) is aimed at advanced-level AI developers and system engineers who wish to optimize inference performance using CANN’s advanced toolset, including the Graph Engine, TIK, and custom operator development.
By the end of this training, participants will be able to:
- Understand CANN's runtime architecture and performance lifecycle.
- Use profiling tools and Graph Engine for performance analysis and optimization.
- Create and optimize custom operators using TIK and TVM.
- Resolve memory bottlenecks and improve model throughput.
Format of the Course
- Interactive lecture and discussion.
- Hands-on labs with real-time profiling and operator tuning.
- Optimization exercises using edge-case deployment examples.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
CANN SDK for Computer Vision and NLP Pipelines
14 時間:The CANN SDK (Compute Architecture for Neural Networks) provides powerful deployment and optimization tools for real-time AI applications in computer vision and NLP, especially on Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at intermediate-level AI practitioners who wish to build, deploy, and optimize vision and language models using the CANN SDK for production use cases.
By the end of this training, participants will be able to:
- Deploy and optimize CV and NLP models using CANN and AscendCL.
- Use CANN tools to convert models and integrate them into live pipelines.
- Optimize inference performance for tasks like detection, classification, and sentiment analysis.
- Build real-time CV/NLP pipelines for edge or cloud-based deployment scenarios.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on lab with model deployment and performance profiling.
- Live pipeline design using real CV and NLP use cases.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Building Custom AI Operators with CANN TIK and TVM
14 時間:CANN TIK (Tensor Instruction Kernel) and Apache TVM enable advanced optimization and customization of AI model operators for Huawei Ascend hardware.
This instructor-led, live training (online or onsite) is aimed at advanced-level system developers who wish to build, deploy, and tune custom operators for AI models using CANN’s TIK programming model and TVM compiler integration.
By the end of this training, participants will be able to:
- Write and test custom AI operators using the TIK DSL for Ascend processors.
- Integrate custom ops into the CANN runtime and execution graph.
- Use TVM for operator scheduling, auto-tuning, and benchmarking.
- Debug and optimize instruction-level performance for custom computation patterns.
Format of the Course
- Interactive lecture and demonstration.
- Hands-on coding of operators using TIK and TVM pipelines.
- Testing and tuning on Ascend hardware or simulators.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Migrating CUDA Applications to Chinese GPU Architectures
21 時間:中國的GPU架構,如Huawei Ascend、Biren和Cambricon MLU,提供了專為本地AI和HPC市場量身定制的CUDA替代方案。
這項由講師指導的培訓(線上或線下)旨在為高級GPU程式設計師和基礎設施專家提供遷移和優化現有CUDA應用程序,以便在中國硬件平台上部署的能力。
培訓結束後,參與者將能夠:
- 評估現有CUDA工作負載與中國芯片替代方案的兼容性。
- 將CUDA代碼庫移植到華為CANN、Biren SDK和Cambricon BANGPy環境中。
- 比較性能並識別跨平台的優化點。
- 解決跨架構支持和部署中的實際挑戰。
課程形式
- 互動式講座和討論。
- 實踐代碼翻譯和性能比較實驗。
- 專注於多GPU適應策略的指導練習。
課程定制選項
- 如需根據您的平台或CUDA項目定制培訓,請聯繫我們安排。
Performance Optimization on Ascend, Biren, and Cambricon
21 時間:Ascend、Biren 和 Cambricon 是中國領先的 AI 硬體平台,各自提供獨特的加速和性能分析工具,用於生產規模的 AI 工作負載。
這項由講師指導的培訓(線上或線下)針對高級 AI 基礎設施和性能工程師,旨在優化跨多個中國 AI 晶片平台的模型推理和訓練工作流程。
在培訓結束時,參與者將能夠:
- 在 Ascend、Biren 和 Cambricon 平台上進行模型基準測試。
- 識別系統瓶頸和記憶體/計算效率低下的問題。
- 應用圖層級、核心層級和操作層級的優化。
- 調整部署管道以提高吞吐量和減少延遲。
課程形式
- 互動式講座和討論。
- 在每個平台上實際使用性能分析和優化工具。
- 專注於實際調整情境的指導練習。
課程定制選項
- 如需根據您的性能環境或模型類型定制此課程,請聯繫我們安排。