Difference between revisions of "Concurrency"

From CS 61A Wiki
Jump to: navigation, search
[unchecked revision][checked revision]
(Parallelism in Python)
(restructure; add)
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Concurrency''', or ''parallel computing'', allows for multiple operations to be performed simultaneously. In the past, the speed of individual processor cores grew at an exponential rate, but due to power and thermal constraints, this increase came to an abrupt end. Since then, CPU manufacturers began to place cores in a single processor, allowing for more operations to be performed concurrently.
+
{{C-class}}
 +
'''Concurrency''', or ''parallel computing'', allows for multiple operations to be performed simultaneously. To make it seem that programs are running concurrently, the CPU rapidly switches between each program, doing some work for each process before switching to the next.
  
==Parallelism in Python==
+
Parallel computing is used to increase the speed of processors, but due to physical heat and energy limitations, CPU manufacturers have turned to multiple processors (multicore) to achieve more processing power.
  
 +
==Parallelism in [[Python]]==
 
===Threading===
 
===Threading===
In ''threading'', multiple '''"threads"''' of execution exist within a single interpreter. Each thread executes code independently from the others, though they share the same data.  
+
In ''threading'', multiple '''"threads"''' of execution exist within a single interpreter. Each thread executes code independently from the others, though they share the same data.
 
<syntaxhighlight lang="python">
 
<syntaxhighlight lang="python">
 
>>> import threading
 
>>> import threading
Line 21: Line 23:
  
 
===Multiprocessing===
 
===Multiprocessing===
'''Multiprocessing''' allows a program to spawn multiple interpreters, or ''processes, each of which can run code independently, without sharing data so any shared state must be communicated between processes.
+
'''Multiprocessing''' allows a program to spawn multiple interpreters, or ''processes'', each of which runs independently without sharing data. Shared data must be communicated explicitly between processes.
 
<syntaxhighlight lang="python">
 
<syntaxhighlight lang="python">
 
>>> import multiprocessing
 
>>> import multiprocessing
Line 34: Line 36:
 
>>> process_hello()
 
>>> process_hello()
 
hello from MainProcess
 
hello from MainProcess
>>> hello from Process-1
+
hello from Process-1
 
</syntaxhighlight>
 
</syntaxhighlight>
  
==Synchronized Data Structures==
+
==Shared state and synchronization==
A <code>Queue</code> is an example of a data structure that provides synchronized operations. The <code>queue</code> module contains a <code>Queue</code> class that provides synchronized first in, first out access to data. The <code>put</code> method adds an item to the <code>Queue</code> and the <code>get</code> method retrieves an item (at the end of the Queue). The class ensures methods are synchronized, so items are not lost no matter how thread operations are intertwined.
+
Problems can arise when multiple threads manipulate and access shared data. A ''race condition'' occurs when multiple threads access the same data and at least one mutates it. This causes unpredictable behavior. Access to shared data in the presence of mutation must be synchronized to prevent other threads from accessing the data while it is being mutated.<ref name="su13_lec">http://inst.eecs.berkeley.edu/~cs61a/su13/slides/27-Parallelism_6pp.pdf</ref>
 +
 
 +
For example, if <code>x = 1</code> and we run the following lines in parallel in separate threads:
 +
* <code>x = x * 2</code>
 +
* <code>x = x + 10</code>
 +
the possible final values of <code>x</code> are 12, 11, 22, and 2.
 +
 
 +
=== Techniques ===
 +
 
 +
====Synchronized data structures====
 +
Some data structures provide synchronized operations. For example, a <code>Queue</code> provides synchronized first-in, first-out access to data. The <code>put</code> method adds an item to the <code>Queue</code> and the <code>get</code> method retrieves an item (from the end of the Queue). The methods are synchronized, so items are not lost no matter how thread operations are intertwined.
  
 
An example of a producer/consumer <code>Queue</code>:
 
An example of a producer/consumer <code>Queue</code>:
 
<syntaxhighlight lang="python">
 
<syntaxhighlight lang="python">
 +
from queue import Queue
  
 
queue = Queue()
 
queue = Queue()
Line 62: Line 75:
  
 
Another use of a <code>Queue</code> is a parallel web crawler that searches for dead links on a website. By following every link hosted by the same site, it continually adds new URLs to a <code>Queue</code>, performs a process on that item, and removes it from the <code>Queue</code> afterwards. By using a synchronized <code>Queue</code>, multiple threads can safely add to and remove from the data structure concurrently.
 
Another use of a <code>Queue</code> is a parallel web crawler that searches for dead links on a website. By following every link hosted by the same site, it continually adds new URLs to a <code>Queue</code>, performs a process on that item, and removes it from the <code>Queue</code> afterwards. By using a synchronized <code>Queue</code>, multiple threads can safely add to and remove from the data structure concurrently.
 +
 +
====Locks====
 +
A ''lock'' ensures that only one thread at a time can hold it. Once it is ''acquired'', no other threads may acquire it until it is ''released''.<ref name="su13_lec"/> For a lock to protect a set of variables, all code blocks that deal with those variables should be surrounded by <code>acquire</code> and <code>release</code> calls.<ref name="disc">http://inst.eecs.berkeley.edu/~cs61a/su13/disc/disc07b.pdf</ref>
 +
 +
====Barriers====
 +
Another way to avoid conflicting access to shared data is to divide a program into phases, ensuring that shared data is mutated in a phase in which no other thread accesses it. ''Barriers'' divide a program into phases by requiring all threads to reach it before any of them can proceed. Code that is executed after a barrier cannot be concurrent with code executed before the barrier.
 +
 +
====Message passing====
 +
Another way is to entirely avoid concurrent access to the same data. Let computations behave independently, but give them a way to send messages to each other to coordinate.<ref name="disc"/>
 +
 +
===Pitfalls===
 +
Incorrectly used synchronization methods can result in the following pitfalls.
 +
 +
====Under-synchronization====
 +
Neglecting to properly synchronize shared accesses.
 +
 +
====Over-synchronization====
 +
Over-synchronizing a program means that non-conflicting operations cannot occur concurrently, reducing the program's efficiency.
 +
 +
====Deadlock====
 +
Deadlock is a situation in which two or more threads or processes are stuck, waiting for each other to finish. No process can continue because it is waiting for other processes that are waiting for it to complete.
 +
 +
==Sources==
 +
<references/>
 +
* [http://composingprograms.com/pages/47-parallel-computing.html Parallel Computing]

Latest revision as of 10:26, 11 August 2014

Concurrency, or parallel computing, allows for multiple operations to be performed simultaneously. To make it seem that programs are running concurrently, the CPU rapidly switches between each program, doing some work for each process before switching to the next.

Parallel computing is used to increase the speed of processors, but due to physical heat and energy limitations, CPU manufacturers have turned to multiple processors (multicore) to achieve more processing power.

Parallelism in Python

Threading

In threading, multiple "threads" of execution exist within a single interpreter. Each thread executes code independently from the others, though they share the same data.

>>> import threading
>>> def thread_hello():
        other = threading.Thread(target=thread_say_hello, args=())
        other.start()
        thread_say_hello()
 
>>> def thread_say_hello():
        print('hello from', threading.current_thread().name)
 
>>> thread_hello()
hello from Thread-1
hello from MainThread

Multiprocessing

Multiprocessing allows a program to spawn multiple interpreters, or processes, each of which runs independently without sharing data. Shared data must be communicated explicitly between processes.

>>> import multiprocessing
>>> def process_hello():
        other = multiprocessing.Process(target=process_say_hello, args=())
        other.start()
        process_say_hello()
 
>>> def process_say_hello():
        print('hello from', multiprocessing.current_process().name)
 
>>> process_hello()
hello from MainProcess
hello from Process-1

Shared state and synchronization

Problems can arise when multiple threads manipulate and access shared data. A race condition occurs when multiple threads access the same data and at least one mutates it. This causes unpredictable behavior. Access to shared data in the presence of mutation must be synchronized to prevent other threads from accessing the data while it is being mutated.[1]

For example, if x = 1 and we run the following lines in parallel in separate threads:

  • x = x * 2
  • x = x + 10

the possible final values of x are 12, 11, 22, and 2.

Techniques

Synchronized data structures

Some data structures provide synchronized operations. For example, a Queue provides synchronized first-in, first-out access to data. The put method adds an item to the Queue and the get method retrieves an item (from the end of the Queue). The methods are synchronized, so items are not lost no matter how thread operations are intertwined.

An example of a producer/consumer Queue:

from queue import Queue
 
queue = Queue()
 
def synchronized_consume():
    while True:
        print('got an item:', queue.get())
        queue.task_done()
 
def synchronized_produce():
    consumer = threading.Thread(target=synchronized_consume, args=())
    consumer.daemon = True
    consumer.start()
    for i in range(10):
        queue.put(i)
    queue.join()
 
synchronized_produce()

Another use of a Queue is a parallel web crawler that searches for dead links on a website. By following every link hosted by the same site, it continually adds new URLs to a Queue, performs a process on that item, and removes it from the Queue afterwards. By using a synchronized Queue, multiple threads can safely add to and remove from the data structure concurrently.

Locks

A lock ensures that only one thread at a time can hold it. Once it is acquired, no other threads may acquire it until it is released.[1] For a lock to protect a set of variables, all code blocks that deal with those variables should be surrounded by acquire and release calls.[2]

Barriers

Another way to avoid conflicting access to shared data is to divide a program into phases, ensuring that shared data is mutated in a phase in which no other thread accesses it. Barriers divide a program into phases by requiring all threads to reach it before any of them can proceed. Code that is executed after a barrier cannot be concurrent with code executed before the barrier.

Message passing

Another way is to entirely avoid concurrent access to the same data. Let computations behave independently, but give them a way to send messages to each other to coordinate.[2]

Pitfalls

Incorrectly used synchronization methods can result in the following pitfalls.

Under-synchronization

Neglecting to properly synchronize shared accesses.

Over-synchronization

Over-synchronizing a program means that non-conflicting operations cannot occur concurrently, reducing the program's efficiency.

Deadlock

Deadlock is a situation in which two or more threads or processes are stuck, waiting for each other to finish. No process can continue because it is waiting for other processes that are waiting for it to complete.

Sources

  1. 1.0 1.1 http://inst.eecs.berkeley.edu/~cs61a/su13/slides/27-Parallelism_6pp.pdf
  2. 2.0 2.1 http://inst.eecs.berkeley.edu/~cs61a/su13/disc/disc07b.pdf