The University of Western Ontario
London, Canada

Department of Computer Science

CS4402/CS9635
Distributed and Parallel Systems

Course Web Site -- Winter 2019

Course Description

To best utilize parallel and distributed systems (multi-core, many-core and cluster) is nowadays an essential task for computer scientists.

This course studies the fundamental aspects of parallel systems and aims at providing an integrated view of the various facets of software development on such systems: hardware architectures, programming languages and models, software development tools, software engineering concepts and design patterns, performance modelling and analysis, experimenting and measuring, application to scientific computing.

Course topics may include but are not limited to:

multi-core, GPGPU, cluster, distributed and shared memory, hierarchical memory,
multi-thread (Julia, CilkPlus, OpenMP), multi-process, message passing (Julia, MPI),
scheduling (work-stealing scheduler) and synchronization,
parallel algorithm design and multithreaded program analysis: parallelism, granularity, memory bandwidth issue, scalability, data locality and cache complexity,
Intel Parallel Composer (Cilk Plus, ArBB, TBB), data race and performance analyzer (Cilkview, Vtune)
NVIDIA's CUDA for GPU
Message Passing Interface (MPI)
high-performance software engineering techniques,
fundamental applications: matrix multiplication, matrix transposition, FFT, reduction, scan, sort, etc.

Follow this link for various resources (software tools and tutorials, hardware documentation, conferences, other HPC course web sites, etc.) regarding this course and HPC in general (http://www.csd.uwo.ca/~moreno/HPC-Resources.html).

Prerequisites for Undergraduate Students

Students must be fluent in C or C++; they must also be familiar with UNIX software tools (shell scripts, makefiles, debuggers).

Instructor

Name:	Marc Moreno Maza
Office:	MC 327
Office Hours:	Tuesdays 13:30-15:20 in MC 327
Email:	moreno@csd.uwo.ca
Phone:	661-2111 x3741

Lecture Notes and Textbook

Notes of each lecture will be available on the course website, approximately one or two days after the oral presentation. There is no textbook.

Course and OWL Websites

The course web site is accessible here For CS4402 and CS9635, the OWL web sites are respectively here and there.

Please check the site often for updates on lecture notes and errata. Also be aware that the course website is not a substitute for actual classroom attendance!

Course outline

Please find the course outline here.

Lecture Topics

The list of topics will be something in the order of:

Overview of parallel computing
Hierarchical memories, cache complexity
The Cilk concurrency platform
Multithreading parallelism and performance
Scheduling and Synchronizing
Parallelism overheads
Parallel Random-Access Machines.
Many-core programming (GPGPUs)
Code optimization techniques for multi-cores and GPGPUs
High-Performance Computing with CUDA
Pipelining
Multiprocessed parallelism, message passing (MPI)
Other concurrency platforms (Julia, Open MP, Threading Building Blocks)

Relation of CS4402 to CS3101

CS3101 is a new course which has started in the 2012-2013 academic year. Its web site is http://www.csd.uwo.ca/~moreno/CS3101-1415.html This course is meant to provide the necessary theoretical background (architectures, models of computations, algorithms) to understand and practice high-performance computing.

CS3101 can be seen as an extension of other CS courses such as

3331A - Foundations of Computer Science I,
3305B - Operating Systems,
3340B - Analysis of Algorithms I,
3350B - Computer Architecture,

thus providing the parallel dimension of Today's Computer Science.

Students who are taking CS4402 the academic year 2018-2019 are likely not to have taken CS3101 in previous years since it was not offered last year. Therefore, for the academic year 2017-2018, CS4402 is a self-contained course (as in previous years) where a sufficient subset of this theoretical background is presented.

Lecture Notes for 2018-2019

An introduction to parallel and distributed computing slides (long version) and slides (short version).

Cache memories: complexity analysis and practical issues. slides and C programs.

Analysis of Multithreaded Algorithms. slides

The Fork-Join Model and its Implementation in Cilk slides. Some examples of CilkPlus programs and a few more here.

Parallel Random-Access Machines. slides

Parallel Prefix Sum slides

Many-core Computing with CUDA. slides and handouts. and simple CUDA programs.

High-Performance Computing with CUDA. slides and handouts.

Lecture Notes from 2017-2018

An introduction to parallel and distributed computing slides (long version) and slides (short version).

Parallel and distributed computing with Julia slides

The Fork-Join Model and its Implementation in Cilk slides. Some examples of CilkPlus programs and a few more here.

Analysis of Multithreaded Algorithms. slides

Cache memories: complexity analysis and practical issues. slides and C programs.

Parallel Random-Access Machines. slides

Parallel Prefix Sum slides

Many-core Computing with CUDA. slides and handouts. and simple CUDA programs.

High-Performance Computing with CUDA. slides and handouts.

Lecture Notes from 2016-2017

An introduction to parallel and distributed computing slides (long version) and slides (short version).

Parallel and distributed computing with Julia slides

The Fork-Join Model and its Implementation in Cilk slides. Some examples of CilkPlus programs and a few more here.

Analysis of Multithreaded Algorithms. slides

Cache memories: complexity analysis and practical issues. slides and handouts and C programs.

Parallel Prefix Sum slides

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores (by Xiaohui Chen). slides

Many-core Computing with CUDA. slides and handouts. and simple CUDA programs.

High-Performance Computing with CUDA. slides and handouts.

Parallel Random-Access Machines. slides

A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads.

Synchronizing without locks. slides and handouts.

Issues in Parallelism (by Matteo Frigo). slides

Class Schedule

Lectures: 3 hours (Tuesdays 15:30 - 16:20 in MC 320 and Wednesdays16:30 - 18:20 in MC-320.

Each student is expected to attend the lectures. In particular, quizzes (short written tests) may take place without notice.

CS4402 projects in 2018-2019

Please note that the papers in the 2017-2018 can also be considered, for instance: A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications paper

(chosen)

Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs paper (chosen)
A Comparative Study and Evaluation of Parallel Programming Models for Shared-Memory Parallel Architectures paper (chosen)
Register Spilling and Live-Range Splitting for SSA-Form Programs paper (chosen)
Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures paper
Expressing pipeline parallelism using TBB constructs: a case study on what works and what doesn't paper
On-the-Fly Pipeline Parallelism paper (chosen)
Pipelining with futures paper (chosen)
Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation paper
AUTOGEN: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs paper (chosen)
GPU-ABiSort: optimal parallel sorting on stream architectures paper (chosen)
Arbitrary-Precision Arithmetics on the GPU paper (chosen)
Optimizing Graph Algorithms for Improved Cache Performance paper (considered)
An efficient parallelization strategy for dynamic programming on GPU paper (considered)
Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation paper (chosen)

CS4402 projects in 2017-2018

Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality paper (chosen)
A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications paper (chosen)
Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model paper
Optimal Loop Unrolling For GPGPU Programs paper
In-Place Matrix Transposition on GPUs paper (chosen)
Dynamic Programming Parallel Implementation for the Knapsack problem paper (chosen)
Parallelizing MiniSat report
Sorting algorithms in CUDA (literaure review and comparative experimentation) (chosen) paper
prefix sum algorithms in CUDA (literaure review and comparative experimentation) wikipedia and one paper to start with (chosen)

Computing Facilities

Each student will be given an account on the Computer Science Department senior undergraduate computing facility, GAUL. In accepting the GAUL account, a student agrees to abide by the department's

Note: After-hours access to certain Computer Science lab rooms is by student card. If a student card is lost, a replacement card will no longer open these lab rooms, and the student must bring the new card to a member of the Systems Group in Middlesex College Room 346.

Marc Moreno Maza

Last modified: Mon Jan 10 EDT 2017

Department of Computer Science

CS4402/CS9635 Distributed and Parallel Systems

Course Web Site -- Winter 2019

Course Description

Prerequisites for Undergraduate Students

Instructor

Lecture Notes and Textbook

Course and OWL Websites

Course outline

Lecture Topics

Relation of CS4402 to CS3101

Lecture Notes for 2018-2019

Lecture Notes from 2017-2018

Lecture Notes from 2016-2017

Problem Sets for 2018-2019

Readings for Problem Set 1

Quiz corrections from 2014-2015

Quiz corrections from 2016-2017

Class Schedule

CS4402 projects in 2018-2019

CS4402 projects in 2017-2018

Computing Facilities

CS4402/CS9635
Distributed and Parallel Systems