The University of Western Ontario
London, Canada
Department of Computer Science
CS4402/CS9635
Distributed and Parallel Systems
Course Web Site -- Winter 2019
Course Description
To best utilize parallel and distributed systems (multi-core,
many-core and cluster) is nowadays an essential task for computer
scientists.
This course studies the fundamental aspects of parallel systems and
aims at providing an integrated view of the various facets of software
development on such systems: hardware architectures, programming
languages and models, software development tools, software engineering
concepts and design patterns, performance modelling and analysis,
experimenting and measuring, application to scientific computing.
Course topics may include but are not limited to:
- multi-core, GPGPU, cluster, distributed and shared memory, hierarchical memory,
- multi-thread (Julia, CilkPlus, OpenMP), multi-process, message passing (Julia, MPI),
- scheduling (work-stealing scheduler) and synchronization,
- parallel algorithm design and multithreaded program analysis: parallelism, granularity, memory bandwidth issue, scalability, data locality and cache complexity,
- Intel Parallel Composer (Cilk Plus, ArBB, TBB), data race and performance analyzer (Cilkview, Vtune)
- NVIDIA's CUDA for GPU
- Message Passing Interface (MPI)
- high-performance software engineering techniques,
- fundamental applications: matrix multiplication, matrix transposition, FFT, reduction, scan, sort, etc.
Follow this link for various resources
(software tools and tutorials, hardware documentation,
conferences, other HPC course web sites, etc.)
regarding this course and HPC in general
(http://www.csd.uwo.ca/~moreno/HPC-Resources.html).
Prerequisites for Undergraduate Students
-
Students must be fluent in C or C++; they must also be familiar
with UNIX software tools (shell scripts, makefiles, debuggers).
Instructor
Name: | Marc Moreno Maza |
Office: | MC 327 |
Office Hours: | Tuesdays 13:30-15:20 in MC 327 |
Email: | moreno@csd.uwo.ca |
Phone: | 661-2111 x3741 |
Lecture Notes and Textbook
Notes of each lecture will be available on the course website, approximately one
or two days after the oral presentation.
There is no textbook.
Course and OWL Websites
The course web site is accessible here
For CS4402 and CS9635, the OWL web sites are respectively
here and
there.
Please check the site often for updates on lecture notes and errata.
Also be aware that
the course website is not a substitute for actual classroom attendance!
Course outline
Please find the course outline
here.
Lecture Topics
The list of topics will be something in the order of:
-
Overview of parallel computing
-
Hierarchical memories, cache complexity
-
The Cilk concurrency platform
-
Multithreading parallelism and performance
-
Scheduling and Synchronizing
-
Parallelism overheads
-
Parallel Random-Access Machines.
-
Many-core programming (GPGPUs)
-
Code optimization techniques for multi-cores and GPGPUs
-
High-Performance Computing with CUDA
-
Pipelining
-
Multiprocessed parallelism, message passing (MPI)
-
Other concurrency platforms (Julia, Open MP, Threading Building Blocks)
Relation of CS4402 to CS3101
CS3101 is a new course which has started in the 2012-2013 academic year. Its web site is http://www.csd.uwo.ca/~moreno/CS3101-1415.html This course is meant to provide the necessary theoretical background (architectures, models of computations, algorithms) to understand and practice high-performance computing.
CS3101 can be seen as an extension of other CS courses such as
-
3331A - Foundations of Computer Science I,
-
3305B - Operating Systems,
-
3340B - Analysis of Algorithms I,
-
3350B - Computer Architecture,
thus providing the parallel dimension
of Today's Computer Science.
Students who are taking CS4402 the academic year 2018-2019
are likely not to have taken CS3101 in previous years since it was not offered last year.
Therefore, for the academic year 2017-2018, CS4402 is a self-contained
course (as in previous years)
where a sufficient subset of this
theoretical background is presented.
Lecture Notes for 2018-2019
An introduction to parallel and distributed computing
slides (long version) and
slides (short version).
Cache memories: complexity analysis and practical issues.
slides and
C programs.
Analysis of Multithreaded Algorithms.
slides
The Fork-Join Model and its Implementation in Cilk
slides. Some examples of CilkPlus programs and a few more
here.
Parallel Random-Access Machines.
slides
Parallel Prefix Sum
slides
Many-core Computing with CUDA.
slides and
handouts. and
simple CUDA programs.
High-Performance Computing with CUDA.
slides and
handouts.
Lecture Notes from 2017-2018
An introduction to parallel and distributed computing
slides (long version) and
slides (short version).
Parallel and distributed computing with Julia
slides
The Fork-Join Model and its Implementation in Cilk
slides. Some examples of CilkPlus programs and a few more
here.
Analysis of Multithreaded Algorithms.
slides
Cache memories: complexity analysis and practical issues.
slides and
C programs.
Parallel Random-Access Machines.
slides
Parallel Prefix Sum
slides
Many-core Computing with CUDA.
slides and
handouts. and
simple CUDA programs.
High-Performance Computing with CUDA.
slides and
handouts.
Lecture Notes from 2016-2017
An introduction to parallel and distributed computing
slides (long version) and
slides (short version).
Parallel and distributed computing with Julia
slides
The Fork-Join Model and its Implementation in Cilk
slides. Some examples of CilkPlus programs and a few more
here.
Analysis of Multithreaded Algorithms.
slides
Cache memories: complexity analysis and practical issues.
slides and
handouts and
C programs.
Parallel Prefix Sum
slides
MetaFork: A Compilation Framework for
Concurrency Platforms Targeting Multicores
(by Xiaohui Chen).
slides
Many-core Computing with CUDA.
slides and
handouts. and
simple CUDA programs.
High-Performance Computing with CUDA.
slides and
handouts.
Parallel Random-Access Machines.
slides
A Many-core Machine Model for Designing Algorithms with Minimum
Parallelism Overheads.
Synchronizing without locks.
slides and
handouts.
Issues in Parallelism (by Matteo Frigo).
slides
Problem Sets for 2018-2019
Problem set 1
Problem set 2
Readings for Problem Set 1
All-Pair Shortest Paths: course notes.
All Pair Shortest Paths and Matrix Multiplication: book chapter
Parallel Algorithms for the All-Sources Generalized Shortest Paths Problem: article by Jeffrey D. Oldham and Vaughan Pratt.
Quiz corrections from 2014-2015
Quiz 1: elements of corrections for a 2014 quiz.
A CUDA quiz and its
elements of correction.
2015 Quiz 2 with elements of corrections.
Quiz corrections from 2016-2017
Quiz 1 with elements of corrections.
Quiz 2 with elements of corrections.
Class Schedule
Lectures: 3 hours (Tuesdays 15:30 - 16:20 in MC 320 and Wednesdays16:30 - 18:20 in MC-320.
Each student is expected to attend the lectures. In particular, quizzes (short written tests) may take place without notice.
CS4402 projects in 2018-2019
Please note that the papers in the 2017-2018 can also be considered,
for instance: A Performance Analysis Framework for Identifying
Potential Benefits in GPGPU Applications
paper
(chosen)
- Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs
paper (chosen)
- A Comparative Study and Evaluation of Parallel
Programming Models for Shared-Memory Parallel
Architectures paper (chosen)
- Register Spilling and Live-Range Splitting for SSA-Form Programs paper (chosen)
- Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures paper
- Expressing pipeline parallelism using TBB constructs: a case study on what works and what doesn't paper
- On-the-Fly Pipeline Parallelism paper (chosen)
- Pipelining with futures paper (chosen)
- Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation paper
- AUTOGEN: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs paper (chosen)
- GPU-ABiSort: optimal parallel sorting on stream architectures
paper (chosen)
- Arbitrary-Precision Arithmetics on the GPU
paper (chosen)
- Optimizing Graph Algorithms for Improved
Cache Performance paper (considered)
- An efficient parallelization strategy for dynamic programming on GPU paper (considered)
- Efficient Processing of Large Data Structures on GPUs:
Enumeration Scheme Based Optimisation paper (chosen)
CS4402 projects in 2017-2018
- Automatic Restructuring of GPU Kernels
for Exploiting Inter-thread Data Locality
paper (chosen)
- A Performance Analysis Framework for Identifying
Potential Benefits in GPGPU Applications
paper (chosen)
- Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model
paper
- Optimal Loop Unrolling For GPGPU Programs
paper
- In-Place Matrix Transposition on GPUs
paper (chosen)
-
Dynamic Programming Parallel Implementation for the Knapsack problem
paper (chosen)
-
Parallelizing MiniSat
report
- Sorting algorithms in CUDA (literaure review and comparative
experimentation) (chosen)
paper
- prefix sum algorithms in CUDA (literaure review and comparative experimentation)
wikipedia
and
one paper to start with
(chosen)
Computing Facilities
Each student will be given an account on the Computer Science Department senior
undergraduate computing facility, GAUL. In accepting the GAUL account, a student agrees
to abide by the department's
Note: After-hours access to certain Computer Science lab rooms is by student card. If a
student card is lost, a replacement card will no longer open these lab rooms, and the
student must bring the new card to a member of the Systems Group in Middlesex College
Room 346.
Marc Moreno Maza
Last modified: Mon Jan 10 EDT 2017