Say you have learned all the Machine Learning and Deep Learning theories and are trying to implement them practically; you got access to your first computer, which has GPUs capable of training Neural Networks. GPU multiprocessing can be incomprehensible if you jump straight into the coding after referencing basic multiprocessing concepts, tutorials, and examples. In this tutorial, I will be discussing all the knowledge and concepts you need to know before you start tweaking your fancy GPUs. For this tutorial, I assume that you already have some knowledge of Deep Learning and are comfortable with object-oriented programming in Python.

The drastic increase in seeking solutions using Deep Learning in the past few years has also increased the demand for GPU-accelerated coding using the platforms like Pytorch and TensorFlow. Even though plenty of tutorials are available to discuss the structure and modules of these libraries, not many of these tutorialsj discuss GPU processing; because the use case for each GPU entirely depends on the hardware itself. In this tutorial, I will document my journey to understanding GPU multiprocessing’s physical and programmatic aspects using Python with useful references.

CPU vs. GPU

I think most of you already may know the fundamental differences between the cores of CPUs and GPUs. If not, I will briefly record the basics of CPUs and GPUs in this section. In traditional computers, the CPU was purposed to do all the computing operations, and the GPU was introduced to perform large-scale repetitive functions. But thanks to the demand for video games with high-end graphics, the GPUs evolved to conduct large-scale repetitive operations.

Coincidentally GPUs are the exact solution Mathematical Researchers were looking for to train Deep Neural Networks since they require a lot of repetitive learning computations.

General Purpose Graphic Processing Unit(GPGPU)

The Wiki definition for GPGPU says:

A general-purpose GPU (GPGPU) is a graphics processing unit (GPU) that performs non-specialized calculations that would typically be conducted by the CPU (central processing unit). Ordinarily, the GPU is dedicated to graphics rendering.
General-purpose computing on graphics processing units

When writing this post, a modern CPU contains around 16 cores, whereas a good high-end GPU has 10,496 cores. I know the first thing that comes to your mind.

Great, now we have more cores; let’s replace the CPUs with GPUs!

Well, it doesn’t work like that..

Even though a CPU has fewer cores, each core can efficiently perform complex operations independently. In a GPU, all cores must perform the same function in parallel on multiple units. This is why the CPU cores can achieve more time efficacy than GPU cores on smaller operations. As mentioned in “Fast Python: High-Performance Techniques for large datasets,” a CPU can be compared to a Ferrari car and the GPU to a bus.

Prerequisite to program CPUs and GPUs

As discussed in the previous section, the computational variations of CPUs and GPUs result from the differences in the hardware and software used for process pipelining. As theorists, we tend to show less care and concern for the hardware aspects of computers. Still, it is essential to familiarize yourself with some hardware aspects of your computer for efficient programming. I recommend the following order of concepts for better understanding.

Python’s Multiprocessing and Multithreading
Internal Hardware of CPU and GPU
PyTorch Multiprocessing

Python Multi-Threading and Multi-Processing

Before we begin, I highly recommend you read the “Fast Python: High-Performance Techniques for large datasets” Chapters 1-4. The concepts I will discuss in the following sections are clearly explained in the book.

What are Threads and Processes?

To put it in simple words, A process is a copy of a Python instance that runs on a computer, and a Thread is an instance of a function that the Process can initiate; both the processes and threads can be run serially or parallelly on the computer(but, for the sake of simplicity lets consider the case of CPU computer).

When you run your code on the CPU, a Main process is initiated, which operates on a single main thread that can use one CPU core at a time.
In multithreading, the main thread is used to create multiple children threads (the children threads can also be used as parent threads to create more Threads), which can be run on different cores of the CPU to execute different ways of parallel and serial execution of Threads to enhance the execution time.
The multiprocessing follows a similar mechanism to create child processes from the Main process or other parent processes. To make a new process, the computer has to create a copy of an entire instance of the code, which consumes a lot of memory and time to start. Hence, creating multi-threads is faster than creating multiprocess.

Python’s Global Interpreter Lock

Unfortunately, Python has a Global Interpreter Lock(GIL) feature, which allows running only one thread simultaneously. Even though you cannot initiate multithreading using Python, you can use multiprocessing for parallel computing, which is computationally expensive compared to multithreading.

You may think This makes Python not a good fit for parallel computations, then why are people using it for Big Data?

Well, to say that multithreading is possible in Python is true and not true. Even though you cannot initiate parallelization of threads, Python allows you to interface your code with libraries written in lower-level languages like C, where you can implement multithreading for specific operations; this feature of Python and the vast amount of community and resources is what makes Python really powerful for Big Data applications.

Pythons Multithreading (CPU and GPU)

The general implementation of multithreading is not very popular in Python because of the GLI, and even though multiprocessing is a bit slower compared to multithreading, Pythons multiprocessing is still fast enough to take care of many complicated tasks. Hence, I will only discuss the GPU implementation of multithreading in this section.

The CUDA library is used for implementing multithreading on GPUs. The Deep Learning libraries like PyTorch and TensorFlow are frameworks built on top of CUDA. Even though these Deep Learning libraries handle a lot of GPU processing, they do not entirely address everything; you will still need to write code w.r.t your hardware and model specifications. Also, I will recommend Numba, a Python library, to implement GPU acceleration.

The implementation of GPU Multithreading using CUDA is out of scope for this article; since we will not use thread-level coding for generating Deep Learning Models. Plenty of resources are available online if you are interested in learning CUDA.

Pythons Multiprocessing (CPU and GPU)

Multiprocessing processing is the basic building block for all Machine Learning Models. The multiprocessing in CPUs differs significantly from GPUs; I will discuss the differences in the next section. In this section, I will explain the general operation of CPU multiprocessing.

Every Python program is an individual process containing a main thread; a Python process is an interpreter instance of the Python library, which allows the process to be executed in bytecode; each process’s main thread can create and terminate other child processes. When we are not using customized multiprocessing, the Python interpreter takes care of the default process management, which generally performs the serial initiation and termination of child processes; in this case, not all cores of the CPUs may not be used.

CPU Multiprocessing

Python’s multiprocessing module provides the capability to control parallel processing and provide communication between processes; the Python library comes with this module, so there is no need to install it.

I highly recommend the Python Multiprocessing Jump-Start book by Jason Brownlee to start on multiprocessing; this book is a must-read if you have never worked with multiprocessing before. This book covers all the basic concepts of the multiprocessing module and how these methods can address different use cases with examples. For learning the implementation techniques of multiprocessing, I recommend reading and understanding the open-source code of PyTorch as you read its applications.

Since this post is only meant to teach the required knowledge to start with process management, I will not cover all the deep concepts of multiprocessing. You can take this tutorial as an introduction to working your way up with advanced process management.

Process – Anatomy and Scope

Python lets us initiate a custom process using the Process class of multiprocessing. To initiate the Process instance, you must provide a required argument target{target method} and the optional arguments args{tuple of input arguments for target method} and daemon{I will explain this later}. Every process can be started using the start() method, and the workflow can be made to wait for the process to finish using the join() method. The process object should always be initiated in the main() thread.

from multiprocessing import Process

def test1(itr):
    print(f'Print process-{itr} called')

if __name__ == '__main__':
    process = Process(target = test, args = (1))

A process can be killed using the terminate() or kill() methods; both have the same functionality with very insignificant differences, as found in their documentation.

freeze_support

Freezing is the operation of packaging the Python instance with all its required libraries and sending it for distribution, where the package is converted to C-Code for machine level execution. This process helps the end user to run the Python code without having top install all the required packages. Different operating systems follow their own methods of packaging the Python code, like the Windows uses .exe and macOS uses .app.

the multiprocessing module provides the freeze_support() function, which can take care of the freezing operation of our Python code on any operating system. Hence it is highly recommended to call this function always at the beginning of the main thread to avoid any RuntimeError‘s.

import multiprocessing as mp

if __name__ == '__main__':
    mp.freeze_support()

Multiprocessing

Multiple child processes can be initialized using various multiprocessing classes from the parent process’s main thread. But every child process instance goes through the same life cycle described in the previous section.

from multiprocessing import Process

def test1(itr):
    print(f'Print process-{itr} called')

if __name__ == '__main__':
    process_list = []
    for i in range(10):
        process_list.append(Process(target=test, args=(i,))
    [process.join() for process in process_list]

The child processes can be initiated with the Process class using some basic iterative algorithm, as shown above. In this case, all the child processes are initiated serially and are added to the list, and all the child processes are terminated after they finished execution in the next list operation, which allows the parent process to terminate after executing all the child processes.

This code only performs initiation and termination of child processes using the Process class. The Process class can be used to address ad-hoc tasks, but it will be tedious to program automated multiprocessing with the Process class. Hence, multiprocessing module offers a various range of classes which can perform operations like, parallel initiation of processes, queue initiation of process, perform coordinate and communication links between processes; which we will learn in following sections.

Parent-Child relationship

I hope everyone is familiar with the definitions of parent process and child process by now; The parent process is the process in whose main thread other child processes are created, the child process is a process created form another processes main thread.

There is also another set of process called orphan process, which are special case of processes created when the parent is killed and the child processes are still running. To understand this phenomenon, you need to understand the classification variable of the Process class called demon

Daemon vs non-daemon process

The daemon and non-daemon processes play a very crucial role in defining the relationship between the parent and child processes. The daemon variable takes the input values of True, False and None(default).

import multiprocessing as mp
from multiprocessing import Process

def test_fun(....):
    ...........

if __name__ == '__main__':
    mp.freeze_support()
    new_process = Process(target=test_fun, args=(....), daemon=True)

By setting the daemon variable to True and child process can be classified as a daemon process, if it is False then the process is classified as a non-daemon process, and if it set to None then the daemon variable inherits its value from the parent process.

Any parent process only waits for the non-daemon process to terminate before its own termination, if a demon process is still running after the termination of the parent process, then the daemon process looses its parent and is moved to the init thread to be executed, such daemon process are classified as orphaned processes.

A daemon process should not have child processes; if it does, the daemon process will terminate without waiting for the child processes leading to unexpected orphaned processes and RuntimeError‘s.

Memory Inheritance

The multiprocessing module offers a set_start_method() to determine the initiation mode for new processes on the parents main thread, the two modes of process initiation are fork and spawn. The set_start_method() should be used before calling any Process instances.

import multiprocessing as mp

if __name__ == '__main__':
    mp.freeze_support()
    mp.set_start_method('spawn')

The fork mode lets all the parent process’s memory to be inherited by the child processes, but does not copy the parent process’s main thread; this mode is faster way to create new processes since they require less memory. This mode is also unsafe, because the shared memory will be controlled by different threads of different processes simultaneously, which may lead to memory deadlock or overload errors.

In spawn mode all the child processes are initiated with a copy of the main process’s memory from scratch and each child process has its own individual memory. This mode is safe, but also slower because of the memory initiation step for all child processes.

Parallel and Serial processing

Pythons multiprocessing library gives us the capability of managing multiple process using two techniques, parallel and serial processing. The parallel processing can be executed using the Pool class, and the serial processing can be executed using Queue class. These classes offer a wide variety of methods which can be used to implement various multiprocessing operations which I am going to explain in the following sections.

Pool(Parallel Processing):

The Pool instance takes the number of worker/child processes to be created as an input and initiate a pool of given number of worker processes; the Pool instance can be terminated using the close() and terminate() methods, the worker processes can be terminated using the join() method. These worker process can execute any number of input tasks inside the scope of the Pool; all worker process run simultaneously to share and execute the input tasks.

from multiprocessing import Pool
...
...

if __name__ == '__main__':
    ...
    pool = Pool(10)
    ... #perform tasks
    pool.close()
    pool.join()

The Process instance can also be initiated using with operator, to safely execute and terminate the Process instance.

from multiprocessing import Pool
...
...

if __name__ == '__main__':
    ...
    with Pool(10) as pool:
        ... #perform tasks
#process is closed automatically

The Process instance offers the following methods:

`apply_async()`

By default the Pool instance executes all tasks synchronously i.e., all tasks will return and terminate in the order they have been fed to the Pool. Since this method is not widely used in GPU multiprocessing, I am skipping the explanation. But you’re welcome to research on your own.

`map()`

Pool mapping is commonly used to implementing a single task on an iterate argument list.

from multiprocessing import Pool
...
...

if __name__ == '__main__':
    ...
    with Pool(n) as pool:
        for result in pool.map(task, iterable_list):
            #work with result
        
#process is closed automatically

Even though the pool instance is pretty handy in executing a single task on multiple datapoints, I personally find it as a brute force way of echoing a task on multiple samples.

Process Memory Pipelining(Serial Processing)

Even though a single parent process hosts multiple children processes, the children processes do not share common memory with parent processes. In multiprocessing every process has its own memory space, the children processes will only inherit the same copies of memory from the parent process.

If any important changes are made in the child process memory they will not get updated in the base memory of the parent process. These changes has to be communicated to the parent process using memory pipelining classes Pipe or Queue.

Memory conflicts may occur in a situation where multiple children processes are trying to access the parent memory simultaneously. To resolve this conflict Python’s Multiprocessing introduced memory pipeline policing classes Event, Condition, Barrier, Lock, and Semaphore.

Pipe:

The Pipe class returns two instances of Connection’s. These two connection instances can be passed to two different processes where they act as data communication chain links between these two processes.

from multiprocessing import Process, Pipe
from time import sleep
def get_data(conn2):
    sleep_time = conn2.recv()
    sleep(sleep_time)
    print(f“Slept for time: {sleep_time}”)
    conn.close()
...
... 
if __name__ == ‘__main__’:
    parent_conn, child_conn = Pipe()
    child_process = Process(target=get_data, args=(child_conn,))
    parent_conn = 2

The above example shows how the data is transferred from the parent process to the child process. The connection instance has three methods send(), recv() and close().

Let’s say the connection_1 and connection_2 instances have been passed to two processes (process_1, process_2). Then the connection_1.send() method takes in one argument, the data that has to be passed from process_1 to process_2. The connection_1.recv() method can be used to load the data that has been passed down from process_1 in process_2. At anytime the connection.close() method can be used to close this connection.

The major downside to the Pipe class is, the connection instances are process specific, once they are shared between two processes the pipe instance can only be used between these two processes.

Queue:

The multiprocessing Queue holds the same generalized definition of “first in first out” for data transfer between processes. Unlike Pipe, the Queue instance can be shared between multiple processes, and it uses serial processing to perform data transfer tasks between multiple processes.

The real power of Queue excel in the scenario of server-client communication. Where a base server process hosting data/code can communicate with multiple client processes. Since the base-server process asynchronously communicates with multiple client processes, the Queues can be used to avoid data corruption issues.

The Queue class has three main methods put(), get() and close(). The put() method is used to pass data into the queue instance, get() method is used for getting the data from the queue, and close() is used for closing the queue instance.

Even though these methods have simple notation, implementing queues with multiple processes is a complex task. Given a situation where multiple processes are simultaneously accessing and editing the same memory base using pipe instances, this can lead to data corruption issues because of lack of synchronization in online data modifications imposed by different processes. Queue’s address this issue by some extent by implementing serial communication and allowing access to only one process at a time. But the data corruption issues can still occur if the queue communication channels are not policed properly.

In multiprocessing no two processes can inherently be aware of each others processing state. This communication plays a key role in designing the communication pipelines, if target process is ready to receive the data and the target process is not ready to send the

Internal Hardware of CPU and GPU

Guide to Process Management – CPUs and GPUs(Nvidia)

Contents

CPU vs. GPU

General Purpose Graphic Processing Unit(GPGPU)

Prerequisite to program CPUs and GPUs

Python Multi-Threading and Multi-Processing

What are Threads and Processes?

Python’s Global Interpreter Lock

Pythons Multithreading (CPU and GPU)

Pythons Multiprocessing (CPU and GPU)

CPU Multiprocessing

Process – Anatomy and Scope

freeze_support

Multiprocessing

Parent-Child relationship

Daemon vs non-daemon process

Memory Inheritance

Parallel and Serial processing

Pool(Parallel Processing):

`apply_async()`

`map()`

Process Memory Pipelining(Serial Processing)

Pipe:

Queue:

Internal Hardware of CPU and GPU

PyTorch Multi-Processing

Leave a comment Cancel reply

Guide to Process Management – CPUs and GPUs(Nvidia)

Contents

CPU vs. GPU

General Purpose Graphic Processing Unit(GPGPU)

Prerequisite to program CPUs and GPUs

Python Multi-Threading and Multi-Processing

What are Threads and Processes?

Python’s Global Interpreter Lock

Pythons Multithreading (CPU and GPU)

Pythons Multiprocessing (CPU and GPU)

CPU Multiprocessing

Process – Anatomy and Scope

freeze_support

Multiprocessing

Parent-Child relationship

Daemon vs non-daemon process

Memory Inheritance

Parallel and Serial processing

Pool(Parallel Processing):

apply_async()

map()

Process Memory Pipelining(Serial Processing)

Pipe:

Queue:

Internal Hardware of CPU and GPU

PyTorch Multi-Processing

Share this:

Leave a comment Cancel reply

`apply_async()`

`map()`