{"id":24427,"date":"2023-03-03T18:05:50","date_gmt":"2023-03-03T17:05:50","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=24427"},"modified":"2023-08-06T21:37:49","modified_gmt":"2023-08-06T19:37:49","slug":"ai-and-scaling-the-compute-for-the-new-moores-law","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/03\/ai-and-scaling-the-compute-for-the-new-moores-law\/","title":{"rendered":"AI and Scaling the Compute for the new Moore\u2019s Law"},"content":{"rendered":"\n<p class=\"has-text-align-justify\">AI and Scaling the Compute becomes more relevant as the strive for larger language models and general purpose AI continues. The future of the trend is unknown as the rate of doubling the compute outpaces Moore&#8217;s Law rate of every two year to a 3.4 month doubling.<\/p>\n\n\n<div class=\"wp-block-aioseo-table-of-contents\"><ul><li><a class=\"aioseo-toc-item\" href=\"#aioseo-introduction\">Introduction<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-requiring-compute-beyond-moores-law\">Requiring compute beyond Moore&#039;s Law<\/a><ul><li><a class=\"aioseo-toc-item\" href=\"#aioseo-increase-in-ai-computation\">Increase in AI computation<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-the-new-moores-law\">The new Moore&#039;s Law<\/a><\/li><\/ul><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-scaling-ai-computation\">Scaling AI computation<\/a><ul><li><a class=\"aioseo-toc-item\" href=\"#aioseo-gpt-3-example\">GPT-3 Example<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-distributed-training-but-how\">Distributed training, but how?<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-the-trend-continues\">The trend continues<\/a><\/li><\/ul><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-conclusion-and-outlook\">Conclusion and outlook<\/a><\/li><li><a class=\"aioseo-toc-item\" href=\"#aioseo-sources\">Sources<\/a><\/li><\/ul><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"aioseo-introduction\">Introduction<\/h2>\n\n\n\n<p class=\"has-text-align-justify\">AI models have been rapidly growing in complexity and sophistication, requiring increasingly powerful computing resources to train and operate effectively. This trend has led to a surge of interest in scaling compute for AI, with researchers exploring new hardware architectures and distributed computing strategies to push the limits of what is possible. <a href=\"#fig1\">Figure 1<\/a> depicts the scale of compute required to train language models for the last ten years.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"664\" data-attachment-id=\"24431\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/03\/ai-and-scaling-the-compute-for-the-new-moores-law\/image-4-10\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4.png\" data-orig-size=\"1167,757\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"image-4\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-300x195.png\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-1024x664.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-1024x664.png\" alt=\"\" class=\"wp-image-24431\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-1024x664.png 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-300x195.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4-768x498.png 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-4.png 1167w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\"><p id=\"fig1\"><strong>Figure 1:<\/strong> Computation used to train notable artificial intelligence systems [<a href=\"#ref1\" title=\"1\">1<\/a>]<\/p><\/figcaption><\/figure>\n\n\n\n<p class=\"has-text-align-justify\">The evolution of AI models has been driven by advances in deep learning, which allows models to learn from vast amounts of data and make predictions or decisions with remarkable accuracy. However, the sheer size and complexity of these models require an unprecedented amount of compute power to train and operate, presenting significant challenges for hardware designers and data center operators. Despite these challenges, progress has been impressive, with new breakthroughs in hardware and software helping to unlock the full potential of AI. Specialized hardware, such as GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) have emerged as powerful tools for training AI models, while distributed computing architectures are being developed to allow multiple machines to work together seamlessly. As AI models continue to grow in complexity, the need for scalable and efficient compute resources will only continue to grow. Researchers and engineers will need to work together to develop new hardware and software solutions that can keep pace with the rapid evolution of AI, unlocking new possibilities for intelligent automation, predictive analytics and other transformative applications.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"aioseo-requiring-compute-beyond-moores-law\">Requiring compute beyond Moore&#8217;s Law<\/h2>\n\n\n\n<p class=\"has-text-align-justify\">As explained in [<a href=\"#ref2\" title=\"2\">2<\/a>] the training of AI systems can be categorized in two distinct Eras, the First Era and the Modern Era. The First Era of compute usage in training AI systems, starting with the perceptron, lasted from the 1950s to the late 2000s and relied on limited computational resources and simple algorithms. In contrast, the Modern Era began around 2012 with the rise of deep learning and the availability of powerful hardware such as GPUs and TPUs, allowing for the training of increasingly complex models with millions or even billions of parameters. [<a href=\"#ref3\" title=\"3\">3<\/a>] even suggests three Eras with the current one being the &#8220;Large Scale Era&#8221; starting with AlphaGo around 2016.<\/p>\n\n\n\n<figure class=\"wp-block-image\" id=\"fig\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"747\" data-attachment-id=\"24433\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/03\/ai-and-scaling-the-compute-for-the-new-moores-law\/image-5-9\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5.png\" data-orig-size=\"2160,1575\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"image-5\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-300x219.png\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-1024x747.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-1024x747.png\" alt=\"\" class=\"wp-image-24433\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-1024x747.png 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-300x219.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-768x560.png 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-1536x1120.png 1536w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-5-2048x1493.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\"><p id=\"fig2\"><strong>Figure 2:<\/strong> AlexNet to AlphaGo Zero: 300,000x increase in compute [<a href=\"#ref2\" title=\"2\">2<\/a>]<\/p><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aioseo-increase-in-ai-computation\">Increase in AI computation<\/h3>\n\n\n\n<p class=\"has-text-align-justify\"><a href=\"#fig2\" title=\"Figure 2\">Figure 2<\/a> depicts the increase in computational resources needed to train AI systems over time, which is evident by the rise of GPUs and TPUs and the transition from Moore&#8217;s Law&#8217;s 2-year doubling of compute to a 3.4-month doubling. This increase in compute demand is exemplified by the difference between AlexNet and AlphaGo Zero, where the latter requires 300,000 times more computational resources to train than the former.<\/p>\n\n\n\n<p class=\"has-text-align-justify\">With the rise of large language models like GPT, more recently known due to the publicly available ChatGPT, the question arose on how the trend on computing such models will continue. As seen in <a href=\"#fig3\" title=\"Figure 3\">Figure 3<\/a> the amount of parameters to be learned are increasing rapidly and thus the amount of data and compute required for the models.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"661\" data-attachment-id=\"24441\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/03\/ai-and-scaling-the-compute-for-the-new-moores-law\/image-7-8\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7.png\" data-orig-size=\"1956,1262\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"image-7\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-300x194.png\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-1024x661.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-1024x661.png\" alt=\"\" class=\"wp-image-24441\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-1024x661.png 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-300x194.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-768x496.png 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7-1536x991.png 1536w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-7.png 1956w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption class=\"wp-element-caption\"><p id=\"fig3\"><strong>Figure 3<\/strong>: Amount of Parameters for Large Language Models [<a href=\"#ref4\" title=\"4\">4<\/a>]<\/p><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aioseo-the-new-moores-law\">The new Moore&#8217;s Law<\/h3>\n\n\n\n<p class=\"has-text-align-justify\">Moore&#8217;s law is the observation that the number of transistors in a dense integrated circuit doubles about every two year [<a href=\"#ref5\" title=\"5\">5<\/a>]. As Moore&#8217;s Law has a physical constraint on how many transistors can be placed on an integrated circuits, which will cease to apply, a new trend in compute seems to emerge in the field of AI. As stated in [<a href=\"#ref4\" title=\"4\">4<\/a>] the increase of the size of the language model and in regard of the 3.4-month doubling time stated in [<a href=\"#ref2\" title=\"2\">2<\/a>] we seem to establish a new &#8220;Moore&#8217;s Law for AI&#8221; for compute, which can only be achieved with massive parallelization techniques.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"aioseo-scaling-ai-computation\">Scaling AI computation<\/h2>\n\n\n\n<p class=\"has-text-align-justify\">An earlier blogpost [<a href=\"#ref6\" title=\"6\">6<\/a>] already handled the explanation on how deep learning models can be parallelized with the different computation paradigms single instance single device (SISD), multi-instance single device (MISD), single-instance multi-device (SIMD) and multi-instance multi-device (MIMD). Furthermore, the concepts of Model and Data parallelization are explained in that post in more detail.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aioseo-gpt-3-example\">GPT-3 Example<\/h3>\n\n\n\n<p class=\"has-text-align-justify\">If we take GPT-3 for example it was scaled up in several ways to enable it to handle its massive size and complexity. Here are some of the key techniques that were used [<a href=\"#ref7\" title=\"7\">7<\/a>]:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Distributed training:<\/strong> GPT-3 was trained using a distributed training approach that involved multiple GPUs and TPUs working together in parallel. The training data was partitioned across the multiple devices and the model parameters were updated using a process called gradient descent, where each device calculated a portion of the gradient and then combined them to update the parameters.<\/li>\n\n\n\n<li><strong>Model parallelism:<\/strong> Because GPT-3 has so many parameters (up to 175 billion), it was not possible to store the entire model on a single device. Instead, the model was split across multiple devices using a technique called model parallelism, where each device stores a portion of the model parameters and computes a portion of the model&#8217;s output.<\/li>\n\n\n\n<li><strong>Pipeline parallelism:<\/strong> To further scale up training, GPT-3 also used a technique called pipeline parallelism, where the model is divided into multiple stages and each stage is run on a separate set of devices in parallel. This enables the model to handle much larger batch sizes and process more data in less time.<\/li>\n\n\n\n<li><strong>Mixed precision training:<\/strong> GPT-3 used mixed precision training, which involves using both 16-bit and 32-bit floating-point numbers to represent the model parameters and compute gradients. This can significantly speed up training and reduce the memory requirements of the model.<\/li>\n\n\n\n<li><strong>Adaptive optimization:<\/strong> Finally, GPT-3 used an adaptive optimization algorithm called AdamW that adjusts the learning rate and weight decay of the model dynamically during training. This helps to avoid overfitting and achieve better performance on the validation set.<\/li>\n<\/ol>\n\n\n\n<p class=\"has-text-align-justify\">In summary, the training of GPT-3 was scaled up using a combination of distributed training, model parallelism, pipeline parallelism, mixed precision training and adaptive optimization. These techniques allowed the model to handle its massive size and complexity.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aioseo-distributed-training-but-how\">Distributed training, but how?<\/h3>\n\n\n\n<p class=\"has-text-align-justify\">In order to train a large AI model, scaling across multiple GPUs, TPUs, and machines is necessary. However, achieving this becomes more complex when using a compute cluster, as distributing tasks and aggregating results requires careful consideration of several points. Specifically, when training a large model at scale using a cluster of machines, the following<br>factors must be taken into account [<a href=\"#ref8\" title=\"8\">8<\/a>][<a href=\"#ref9\" title=\"9\">9<\/a>]:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Communication overhead: <\/strong>Distributed training involves exchanging gradients and model updates between different machines, which can introduce significant communication overhead. Optimizing communication and reducing the frequency of communication can help reduce the overhead and speed up training.<\/li>\n\n\n\n<li><strong>Load balancing: <\/strong>Distributing the workload across multiple machines requires careful load balancing to ensure that each machine has a similar workload. Imbalanced workloads can lead to underutilization of some machines and slower training overall.<\/li>\n\n\n\n<li><strong>Fault tolerance:<\/strong> When using clusters of machines, it is important to consider fault tolerance, as failures in one or more machines can interrupt training. Strategies for fault tolerance include checkpointing, replication of model parameters, and the use of redundant compute nodes.<\/li>\n\n\n\n<li><strong>Network topology: <\/strong>The topology of the network connecting the machines can affect the performance of distributed training. For example, using a network with high bandwidth and low latency can reduce communication overhead and speed up training.<\/li>\n\n\n\n<li><strong>Scalability: <\/strong>The ability to scale up the number of machines used for training is important to accommodate largermodels anddatasets. Ensuringthatthetrainingprocess is scalablerequires careful consideration of the communication patterns and load balancing across a large number of machines.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"aioseo-the-trend-continues\">The trend continues<\/h3>\n\n\n\n<p class=\"has-text-align-justify\">Taking a look at an even larger language model, the Megatron-Turing NLG [<a href=\"#ref10\" title=\"\">10<\/a>], we can see that the trend continues. Such large models therefore are required to train on large-scale infrastructure with special software and hardware design optimized for system throughput for large datasets. In [<a href=\"#ref10\" title=\"10\">10<\/a>] the tradeoffs for some techniques are mentioned. Only the combination of several techniques and the use of a supercomputer powered by 560 DGX 100 servers using each eight NVIDIA A100 80GB Tensor Core GPUs allowed NVIDIA to use scale up to thousand of GPUs and train the model in an acceptable time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"aioseo-conclusion-and-outlook\">Conclusion and outlook<\/h2>\n\n\n\n<p class=\"has-text-align-justify\">The trend for more compute is expected to continue as AI applications become more complex and require larger datasets and more sophisticated algorithms. To keep up with this demand, we can expect continued improvements in specialized hardware and software optimization techniques, such as neural architecture search, pruning and quantization. Scaling aspects of both software and hardware are critical to meet the increasing demand for computing power and to make AI more efficient and accessible to a wider range of applications.<br>In contrast to chasing even larger models another approach would be to focus more on specific tasks than general purpose models. As language models continue to grow in size, researchers are beginning to see diminishing returns in terms of their performance improvements. While larger language models have shown impressive capabilities in natural language processing tasks, the computational and financial resources required to train and run them have also increased exponentially. Therefore, [<a href=\"#ref4\" title=\"4\">4<\/a>] proposes a more practical approach which might be more cost and environment friendly than the next big general purpose AI.<br>As for now we can continue to observe large tech companies joining together for the next big AI model, using supercomputers, cloud infrastructure and every compute they have to build even more impressive AI and thus develop even more sophisticated software and hardware architectures to facilitate the massive amounts of data and computation required to train such models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"aioseo-sources\">Sources<\/h2>\n\n\n\n<p id=\"ref1\">[1] OurWorld in Data. \u201cComputation usedtotrain notable artificial intelligence systems\u201d. In: (2023). URL: <a href=\"https:\/\/ourworldindata.org\/grapher\/artificial-intelligence-training-computation\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/ourworldindata.org\/grapher\/artificial-intelligence-training-computation<\/a>.<\/p>\n\n\n\n<p id=\"ref2\">[2] OpenAI. \u201cAI and Compute\u201d. In: (2018). URL: <a href=\"https:\/\/openai.com\/research\/ai-and-compute\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/openai.com\/research\/ai-and-compute<\/a>.<\/p>\n\n\n\n<p id=\"ref3\">[3] Jaime Sevilla et al. Compute Trends Across Three Eras of Machine Learning. 2022. arXiv: <a href=\"https:\/\/arxiv.org\/abs\/2202.05924\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">2202.05924<\/a> [cs.LG].<\/p>\n\n\n\n<p id=\"ref4\">[4] Julien Simon. \u201cLarge Language Models\u201d. In: (2021). URL: <a href=\"https:\/\/huggingface.co\/blog\/large-language-models\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/huggingface.co\/blog\/large-language-models<\/a>.<\/p>\n\n\n\n<p id=\"ref5\">[5] Wikipedia. Moore\u2019s Law. 2023. URL: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Moore%27s_law\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/en.wikipedia.org\/wiki\/Moore%5C%27s_law<\/a>.<\/p>\n\n\n\n<p id=\"ref6\">[6] Annika Strau\u00df and Maximilian Kaiser. \u201cAn overview of Large Scale Deep Learning\u201d. In: (2021). URL: <a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/31\/an-overview-of-large-scale-deep-learning\/\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/31\/an-overview-oflarge-scale-deep-learning\/<\/a>.<\/p>\n\n\n\n<p id=\"ref7\">[7] Tom B. Brown et al. \u201cLanguage Models are Few-Shot Learners\u201d. In: CoRR abs\/2005.14165 (2020).<br>arXiv: 2005.14165. URL: <a href=\"https:\/\/arxiv.org\/abs\/2005.14165\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/arxiv.org\/abs\/2005.14165<\/a>.<\/p>\n\n\n\n<p id=\"ref8\">[8] Mart\u00edn Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. 2016. URL: <a href=\"https:\/\/www.usenix.org\/system\/files\/conference\/osdi16\/osdi16-abadi.pdf\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/www.usenix.org\/system\/files\/conference\/osdi16\/osdi16-abadi.pdf<\/a>.<\/p>\n\n\n\n<p id=\"ref9\">[9] Henggang Cui, Gregory R. Ganger, and Phillip B. Gibbons. Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. 2015. URL:<a href=\"https:\/\/www.pdl.cmu.edu\/PDL-FTP\/BigLearning\/CMU-PDL-15-107.pdf\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\"> https:\/\/www.pdl.cmu.edu\/PDLFTP\/BigLearning\/CMU-PDL-15-107.pdf<\/a>.<\/p>\n\n\n\n<p id=\"ref10\">[10] Paresh Kharya and Ali Alvi. \u201cUsing DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World\u2019s Largest and Most Powerful Generative Language Model\u201d. In: (2021). URL: <a href=\"https:\/\/developer.nvidia.com\/blog\/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model\/\" target=\"_blank\" rel=\"noopener nofollow\" title=\"\">https:\/\/developer.nvidia.com\/blog\/using-deepspeed-and-megatron-to-trainmegatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generativelanguage-model\/<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI and Scaling the Compute becomes more relevant as the strive for larger language models and general purpose AI continues. The future of the trend is unknown as the rate of doubling the compute outpaces Moore&#8217;s Law rate of every two year to a 3.4 month doubling. Introduction AI models have been rapidly growing in [&hellip;]<\/p>\n","protected":false},"author":1119,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[652,650,223],"tags":[355,642,641,217],"ppma_author":[898],"class_list":["post-24427","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-scalable-systems","category-ultra-large-scale-systems","tag-ai","tag-gpt","tag-ml","tag-uls"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":27789,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2025\/07\/24\/beyond-reactive-how-ai-is-revolutionizing-kubernetes-autoscaling\/","url_meta":{"origin":24427,"position":0},"title":"Beyond Reactive: How AI is Revolutionizing Kubernetes Autoscaling","author":"Hannah Holzheu","date":"24. July 2025","format":false,"excerpt":"Note:\u00a0This blog post was written for the module Enterprise IT (113601a) in the summer semester of 2025 Introduction Kubernetes has become the leading open-source platform for managing containerized applications. Its ability to automate deployment, scaling, and operations helps teams efficiently manage microservices architectures and dynamic cloud workloads. A cornerstone of\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/07\/AIvsRuleBased.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/07\/AIvsRuleBased.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/07\/AIvsRuleBased.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/07\/AIvsRuleBased.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":23138,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/31\/an-overview-of-large-scale-deep-learning\/","url_meta":{"origin":24427,"position":1},"title":"An overview of Large Scale Deep Learning","author":"mk374","date":"31. March 2022","format":false,"excerpt":"article by Annika Strau\u00df (as426) and Maximilian Kaiser (mk374) Introduction Improving Deep Learning with ULS for superior model training Single Instance Single Device (SISD) Multi Instance Single Device (MISD) Multi Instance Multi Device (MIMD) Single Instance Multi Device (SIMD) Model parallelism Data parallelism Improving ULS and its components with the\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/quantum-physics-g1357f44f5_1920-Kopie.jpg?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":27565,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2025\/02\/28\/scaling-an-ai-transcription-model-as-a-service\/","url_meta":{"origin":24427,"position":2},"title":"Scaling an AI Transcription Model as a Service","author":"ns144","date":"28. February 2025","format":false,"excerpt":"Ton-Texter is a Software as a service solution that delivers state of the art transcription performance. In this blog post, we will explore how we have improved its scalability to handle high demand.","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/Featured_Image01.jpg?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":24246,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/02\/27\/how-edge-computing-is-moving-the-cloud-closer-to-the-user\/","url_meta":{"origin":24427,"position":3},"title":"How Edge Computing is moving the Cloud closer to the User","author":"Nikolai Thees","date":"27. February 2023","format":false,"excerpt":"Did you know clouds have sharp edges? What is Edge Computing? Let\u2019s say you want to deploy a web application. In order to serve your app to your users, you need a server on which you can run your application.Traditionally, you had the option to either buy and run the\u2026","rel":"","context":"In &quot;Cloud Technologies&quot;","block_context":{"text":"Cloud Technologies","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/scalable-systems\/cloud-technologies\/"},"img":{"alt_text":"","src":"https:\/\/lh4.googleusercontent.com\/uCYgoQ2o7ueAQYKEvAup43hsF7rDPIyBl5Uh-qMTzmOU5mozruJsWI_kp_BTfpjhMkcrhbEzHoZvBthhNk9GrF9KE3Oxd73nnOJ2YZsIZt66xSEJghrtdVd00YeozgM6k-ACpmcCHexjQ8VLo6EC-QM","width":350,"height":200},"classes":[]},{"id":5635,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2019\/03\/05\/a-dive-into-serverless-on-the-basis-of-aws-lambda\/","url_meta":{"origin":24427,"position":4},"title":"A Dive into Serverless on the Basis of AWS Lambda","author":"Can Kattwinkel","date":"5. March 2019","format":false,"excerpt":"Hypes help to overlook the fact that tech is often reinventing the wheel, forcing developers to update applications and architecture accordingly in painful migrations. Besides Kubernetes one of those current hypes is Serverless computing. While everyone agrees that Serverless offers some advantages it also introduces many problems. The current trend\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/03\/warm.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/03\/warm.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/03\/warm.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":23961,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/02\/10\/microservices-any-good\/","url_meta":{"origin":24427,"position":5},"title":"Microservices &#8211; any good?","author":"Kim Bastiaanse","date":"10. February 2023","format":false,"excerpt":"As software solutions continue to evolve and grow in size and complexity, the effort required to manage, maintain and update them increases. To address this issue, a modular and manageable approach to software development is required.\u00a0Microservices architecture provides a solution by breaking down applications into smaller, independent services that can\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=1400%2C800&ssl=1 4x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":898,"user_id":1119,"is_guest":0,"slug":"mb412","display_name":"Marvin Blessing","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/7022b948d92d48d33b1e98eec6739244?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/24427","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/1119"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=24427"}],"version-history":[{"count":12,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/24427\/revisions"}],"predecessor-version":[{"id":25324,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/24427\/revisions\/25324"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=24427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=24427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=24427"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=24427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}