Concurrency

From CS 61A Wiki
Jump to: navigation, search

Concurrency, or parallel computing, allows for multiple operations to be performed simultaneously. To make it seem that programs are running concurrently, the CPU rapidly switches between each program, doing some work for each process before switching to the next.

Parallel computing is used to increase the speed of processors, but due to physical heat and energy limitations, CPU manufacturers have turned to multiple processors (multicore) to achieve more processing power.

Parallelism in Python

Threading

In threading, multiple "threads" of execution exist within a single interpreter. Each thread executes code independently from the others, though they share the same data.

>>> import threading
>>> def thread_hello():
        other = threading.Thread(target=thread_say_hello, args=())
        other.start()
        thread_say_hello()
 
>>> def thread_say_hello():
        print('hello from', threading.current_thread().name)
 
>>> thread_hello()
hello from Thread-1
hello from MainThread

Multiprocessing

Multiprocessing allows a program to spawn multiple interpreters, or processes, each of which runs independently without sharing data. Shared data must be communicated explicitly between processes.

>>> import multiprocessing
>>> def process_hello():
        other = multiprocessing.Process(target=process_say_hello, args=())
        other.start()
        process_say_hello()
 
>>> def process_say_hello():
        print('hello from', multiprocessing.current_process().name)
 
>>> process_hello()
hello from MainProcess
hello from Process-1

Shared state and synchronization

Problems can arise when multiple threads manipulate and access shared data. A race condition occurs when multiple threads access the same data and at least one mutates it. This causes unpredictable behavior. Access to shared data in the presence of mutation must be synchronized to prevent other threads from accessing the data while it is being mutated.[1]

For example, if x = 1 and we run the following lines in parallel in separate threads:

  • x = x * 2
  • x = x + 10

the possible final values of x are 12, 11, 22, and 2.

Techniques

Synchronized data structures

Some data structures provide synchronized operations. For example, a Queue provides synchronized first-in, first-out access to data. The put method adds an item to the Queue and the get method retrieves an item (from the end of the Queue). The methods are synchronized, so items are not lost no matter how thread operations are intertwined.

An example of a producer/consumer Queue:

from queue import Queue
 
queue = Queue()
 
def synchronized_consume():
    while True:
        print('got an item:', queue.get())
        queue.task_done()
 
def synchronized_produce():
    consumer = threading.Thread(target=synchronized_consume, args=())
    consumer.daemon = True
    consumer.start()
    for i in range(10):
        queue.put(i)
    queue.join()
 
synchronized_produce()

Another use of a Queue is a parallel web crawler that searches for dead links on a website. By following every link hosted by the same site, it continually adds new URLs to a Queue, performs a process on that item, and removes it from the Queue afterwards. By using a synchronized Queue, multiple threads can safely add to and remove from the data structure concurrently.

Locks

A lock ensures that only one thread at a time can hold it. Once it is acquired, no other threads may acquire it until it is released.[1] For a lock to protect a set of variables, all code blocks that deal with those variables should be surrounded by acquire and release calls.[2]

Barriers

Another way to avoid conflicting access to shared data is to divide a program into phases, ensuring that shared data is mutated in a phase in which no other thread accesses it. Barriers divide a program into phases by requiring all threads to reach it before any of them can proceed. Code that is executed after a barrier cannot be concurrent with code executed before the barrier.

Message passing

Another way is to entirely avoid concurrent access to the same data. Let computations behave independently, but give them a way to send messages to each other to coordinate.[2]

Pitfalls

Incorrectly used synchronization methods can result in the following pitfalls.

Under-synchronization

Neglecting to properly synchronize shared accesses.

Over-synchronization

Over-synchronizing a program means that non-conflicting operations cannot occur concurrently, reducing the program's efficiency.

Deadlock

Deadlock is a situation in which two or more threads or processes are stuck, waiting for each other to finish. No process can continue because it is waiting for other processes that are waiting for it to complete.

Sources

  1. 1.0 1.1 http://inst.eecs.berkeley.edu/~cs61a/su13/slides/27-Parallelism_6pp.pdf
  2. 2.0 2.1 http://inst.eecs.berkeley.edu/~cs61a/su13/disc/disc07b.pdf