UWORCS 2017 - Presenters

Presenters

1A Bioinformatics

Nilesh Khiste

HISEA - Hierarchical Seed Aligner for Pacbio data

An essential step in assembling pacbio sequencing data is the detection of alignments/overlaps between all pairs of reads. High error rate and very long reads make this a bigger challenge as compared to Illumina sequencing data. I will be presenting a high-level overview of genome alignment process and then talk about a new aligner algorithm, HISEA (HIerarchical SEed Aligner) for pacbio sequencing data. HISEA algorithm has the best alignment detection sensitivity among all aligners for pacbio sequencing data and produces better assemblies in Canu assembly pipeline as compared to its default aligner.

Qin Dong

Improved whole genome olignucleotide design

Oligonucleotides are short, single-stranded fragments of DNA or RNA, designed to readily bind with a unique part in the target sequence. They have many important applications including PCR (polymerase chain reaction) amplification, microarrays, or FISH (fluorescence in situ hybridization) probes.
While traditional microarrays are commonly used for measuring gene expression levels by probing for sequences of known and predicted genes, high-density, whole genome tiling arrays probe intensively for sequences that are known to exist in a contiguous region. Current programs for designing oligonucleotides for tiling arrays are not able to produce results that are even close to optimal since they allow oligonucleotides that are too similar with non-targets, thus enabling unwanted cross-hybridization.
Our tests show that up to a staggering 70% of the oligonucleotides produced are not good. We present a new program, BOND-tile, that produces much better tiling arrays, as shown by extensive comparison with leading programs. All oligonucleotides produced by BOND-tile are good while it runs two orders of magnitude faster. And designing oligonucleotide for whole human genome tiling array is no longer an insurmountable problem.

Yi Liu

Protein Disulfide Bond Characterization from Tandem Mass Spectra

In this research, we present a new computational approach for matching an input MS/MS spectrum against the putative disulfide linkage structures hypothetically constructed from the protein database. More specifically, we introduce an effective two-layer scoring system to evaluate the significance of the matching between spectrum and structure, based on which we have also developed a useful target-decoy strategy for providing quality control and reporting false discovery rate in the final results. Systematic experiments conducted on both low- complexity and high-complexity datasets demonstrated the efficiency of our proposed method for the identification of disulfide bonds from MS/MS spectra, and depicted its potential in characterizing disulfide bonds at the proteome scale instead of just a single protein.

Yiwei Li

SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome

Protein-protein interactions (PPI) are vital processes in molecular biology. However, the current understanding of PPIs is far from satisfactory. Improved methods of predicting PPIs are very much needed. Since experimental methods are labour and time consuming and lack accuracy, the improvement is expected to come from the area of computational methods. We present SPRINT, a new PPI prediction algorithm and program. It surpasses other state-of-the-art programs in terms of accuracy and speed.

Zhewei Liang

NBPMF: Novel Network-Based Inference Methods for Peptide Mass Fingerprinting

Mass spectrometry (MS) has recently become a primary tool for protein identification and quantification. Peptide mass fingerprinting (PMF) is widely used to identify proteins from MS data. Conventional PMF representatives such as probabilistic MOWSE algorithm, is based on mass distribution of tryptic peptides. In this paper we develop a novel network-based inference software termed NBPMF. By analyzing peptide-protein bipartite network, we developed new peptide protein matching score functions. We present two methods: the static one, ProbS, is based on an independent probability framework; and the dynamic one, HeatS, depicts input data from dependent point of view. We also use linear regression to adjust the matching score according to the masses of proteins. In addition, we consider the order of retention time to further correct the score function. In the post processing, we restrict that a peak can only be assigned to one peptide in order to reduce random matches. The experiments on simulated and real data demonstrate that our NBPMF approaches lead to significantly improved performance compared to several state-of-the-art methods.

2A Computer Algebra & Theory of Computer Science

Daniel Page

Approximation Algorithms for Scheduling Unrelated Parallel Machines

Consider the following parallel machine scheduling problem. Let m and n be the number of parallel machines and jobs, respectively. In addition, if job j is scheduled on machine i, job j takes p_{i,j} time units on machine i. Given n jobs, the goal is to schedule each job on a parallel machine non-preemptively so that the schedule containing all n jobs has minimum makespan. This is a classic NP-hard problem in Scheduling Theory called the makespan minimization problem on unrelated parallel machines, commonly abbreviated as R||C_{max}. Currently the best approximation algorithms for R||C_{max} have approximation ratio 2. Despite over 25 years of being intensively studied in the literature, finding an approximation algorithm with approximation ratio better than 2 for R||C_{max} still remains an open problem. In response, researchers have focused on NP-hard special cases of R||C_{max} that share the same inapproximability bounds in the hopes to better understand the problem and provide approximation algorithms with improved approximation ratios. Two such problems are the graph balancing problem and the restricted assignment problem. In this talk we discuss recent developments in our research for these two problems, with a focus on when the parallel machines are uniform i.e. every parallel machine has a speed.

Ethan Jackson

Applied Algebraic Methods for Neural Networks

Despite significant effort, there is currently no formal or de facto standard framework or format for sharing general neural network models between fields or even between systems. In computational neuroscience, there have been some attempts to formalize connectionist notations and generative operations for neural networks, including Connection Set Algebra. In machine learning, the use of linear algebra and tensor-based models are widespread. Despite the common elements between tools from each of these fields, there exists no common mathematical framework for representing, manipulating, and evaluating general neural networks. To address this, we sought to find simple yet rigorously defined generalizations for existing tools without reinventing any existing mathematical machinery. As a first step towards a comprehensive generalization, we first present a simple framework based on linear and relation algebras. We demonstrate the usefulness of this approach first by defining new operations for network composition along with proofs of their most important properties. We then use these operations to build a symbolic model for an artificial neural network that can target two different experimental environments: HyperNEAT’s Agent Simulator and Deepmind Lab. The result is an algebraic framework for neural network construction and description that generalizes network description for at least two systems. This work also serves as a model approach for further generalization efforts and their applications.

Nasim Samei

Analysis of a Local Search Algorithm for the $k$-Facility Location Problem

In the $k$-facility location problem we are given the possible locations for a group of at most $k$ facilities that are to provide some service to a set of clients. Each location has an associated cost for building a facility there. The goal is to select the locations for the facilities that minimize the sum of the cost for building the facilities and the total cost for servicing the clients. In this paper we analyse a local-search heuristic with multiple swaps for the metric $k$-facility location problem and prove that it has locality gap of $2+ \sqrt{3}+ \epsilon$ for any constant $\epsilon>0$. This matches the bound obtained by Zhang \cite{Zhang} for a local search algorithm that uses insertions and deletions in addition to swaps. We also prove a second, tight, bound for the locality gap of our algorithm which is better than the above one in many cases. For example, when the ratio between the highest and lowest facility cost is bounded by $p+1$, where $p$ is the maximum number of facilities that can be exchanged in a swap operation, the locality of our algorithm is $3+ \frac{2}{p}$; this matches the locality gap of the algorithm of Arya et al. for the $k$-median problem.

Linxiao Wang

Asymptotically fast algorithms for Sub-/Inter-Group Key Distribution in ad hoc networks

People need to communicate with each other in many emerging networks, i.e., in ad hoc networks. To ensure security for group communication, group key management, as a fundamental cryptographic primitive, has been proposed. We use an approach, based on polynomial arithmetic, to construct a group key management scheme. Our protocols do not require interaction between users. Storage and computation analyses show that our approach is secure and efficient, compared with existing schemes. By using asymptotically fast algorithms, based on fast Fourier Transforms (FFTs), we improve the efficiency of the computations with respect our previous work.

3A HCI, Distributed Systems & Software Engineering

Roopa Bose

Online Graphics for the Blind: Intermediate Format Generation for Graphic Categories

Visual representations like pictures, paintings and diagrams are not accessible for visually impaired people because of the high amount of visual content present in these representations. Making textual information accessible is quite easy with text-to-speech technology. Whereas, handling a visual representation using such technology does not make it quite accessible. Tactile graphics came into place with the aim of making visual representations accessible via touch. However, providing accessibility of visual representations through touch is challenging due to reasons like slow process of tactile perception, difficulty in discriminating small objects in tactile graphics and low tactile resolution compared to visual resolution.
There are lots of existing solutions to provide accessibility to diagrams, maps, photographs, web pages etc. However, most of the solutions produce tactile graphics primarily for a dedicated output device. In this his talk, we describe a general framework for generating tactile and multimodal graphics from various sources. The key module to generate the intermediate format for a specific graphic category is explained in detail.

Sheikh Abdullah

A survey of data mining methods, interactive visualization techniques, and electronic alert tools to detect the episodes of kidney injury from Electronic Health Records: A proposed visual analytic model to predict, manage and monitor kidney diseases

Kidney injury refers to the deterioration of kidney function which is affecting over 10% of inpatients and about 25 to 55% of the patients in the intensive care unit (ICU). In this presentation, I will identify, categorize and analyze the tools and studies on electronic health records (EHR), where the primary goal is to highlight the data mining, interactive visualization, and clinical alert studies that deal with acute kidney injury and chronic kidney disease. The shortcomings of the existing tools will be identified with respect to the cognitive load distribution and their ability to support complex cognitive activities. Prior to the discussion of the tools, I will provide an overview of the potential concepts including the kidney injury, risk factors, types of kidney diseases, types of EHR, and challenges in the EHR. As a final point, a very brief overview of a real-time visual analytics tool will be presented.

Neda Rostamzadeh

A Survey of Interactive Visualization Tools to Support Electronic Health Records (EHR) : a Proposed Visual Analytics tool to Identify, Manage and Monitor Acute Kidney Injury Using EHR

Acute kidney injury (AKI) is the most common clinical event in hospitalized patients affecting approximately 2 to 9% of the patients admitted to hospitals. Therefore, early detection and iden-tication of AKI is crucial for devising suitable management strategies to decrease the length and severity of this disease and improving the patient outcomes. In this study, we propose to design and implement a visual analytics tool that helps healthcare providers in detection, management and monitoring of patients with AKI. We identify the limitations and shortcomings of the existing AKI surveillance systems and discuss how the proposed visual analytics tool can facilitate the performance of cognitive activities of the healthcare providers by combining the strengths of interactive visualization tools and data analysis tools. Finally, we discuss the significance of information visualization in exploration of EHR, how it can address the challenges that arise while dealing with large and heterogeneous data that is stored in EHR and give an overview to a number of existing interactive visualization tools that have been developed for EHR.

Motahera Shermin

Secure Interoperability in Cloud

Cloud computing has been on the rise since its inception; its functional and economic applications are substantial. Yet the cloud computing model hasn’t been adopted to its full potential. The lack of transparency and control over security and governance have been prominent concerns to date. These concerns become more of an obstacle when applying interoperability in practice in a heterogeneous or multi-cloud environment. This becomes more complicated when we bring dynamic service discovery into the equation. The security control frameworks currently in practice for cloud providers often do not meet expectations. In this discussion, we highlight practical use cases for interoperable cloud architecture along with the necessity to leverage service level agreements on dynamic service discovery hosted on different cloud service providers. We then discuss our research direction towards an architectural solution for a seamless adoption of interoperability through a unified model.

Thiago Coelho Prado

Autonomic Data Center Management using Multi-Agent Systems

Management of a data center is a challenging task due to its high complexity. One aspect of this activity is how to handle efficiently power consumption while facing constraints present in the Service Level Agreements (SLA). Among many approaches to this problem, one approach is to look at self-management capabilities of a data center given a set of policies – this is often referred to as Autonomic Computing.
The properties described by the Autonomic Computing model can be compared to one of the many definitions of software agents. Based on Autonomic Computing, we investigate approaches based on Multi-agent Systems to manage a data center and the trade-off between energy efficiency and SLAs. We look to evaluate the Multi-agent behaviour in a simulated environment where different configurations and algorithms can be studied.
With this work, we expect to improve the workflow between validation and deployment of Multi-agent Systems, aiming to improve energy efficiency in a data center.

1B Data Mining, Machine Learning & AI

Annette Megerdichian Azad

Mining of Primary Healthcare Patient Data with Selective Multimorbid Diseases

Despite a large volume of research on the prognosis, diagnosis and overall burden of multimorbidity, very little is known about socio-demographic characteristics of multimorbid patients. This thesis aims to analyze the socio-demographic characteristics of patients with multiple chronic conditions (multimorbidity), focusing on patient groups sharing the same combination of diseases. Several methods were explored to analyze the co-occurrence of multiple chronic diseases as well as the correlations between socio-demographics and chronic conditions. These methods include disease pair distributions over gender, age groups and income level quintiles, Multimorbidity Coefficients for measuring the concurrence of disease pairs and triples, and k-modes clustering to examine the demographics of patients with the same chronic condition. The experiments suggest that patient income quintile does not affect multimorbidity rates, although gender and age group may play an important role in prevalence of multimorbidity and diagnosis of certain disease combinations.

Dmitrii Marin

Breiman's bias: from Gini to kernel K-means clustering

Kernel K-means and Gini are two seemingly unrelated data clustering criteria that, to the best of our knowledge, have not been connected in the past. The former is commonly used as a graph clustering criterion, whereas the latter is widely used as a splitting criterion for building binary decision trees in the context of classification. We demonstrate that under relatively mild conditions kernel K-means objective reduces to Gini criterion. Furthermore, we formally prove that under these conditions kernel K-means yields poor clustering with small dense subsets of points separated from the rest. Such bias of kernel K-means and related pairwise clustering methods was experimentally observed in the literature, but not explained theoretically. We observe that a similar bias was proven for discrete histogram-based Gini criterion in (L.Breiman, 1996). We extend this result to continuous Gini criterion and the corresponding kernel K-means objective. Our theoretical analysis provides insights for adaptive kernel bandwidth selection avoiding the bias. In particular, standard KNN is presented as a density equalizing strategy via our isometric embedding argument. We show image segmentation and clustering experiments illustrating Breiman's bias and the positive effect of adaptive bandwidth strategies.

James Hughes

Searching for Nonlinear Relationships in fMRI Data with Symbolic Regression

The vast majority of methods employed in the analysis of functional Magnetic Resonance Imaging (fMRI) produce exclusively linear models; however, it is clear that linear models cannot fully describe a system with the observed behavioral complexity of the human brain --- an intrinsically nonlinear system. By using tools embracing the possibility of modeling the underlying nonlinear system we may uncover meaningful undiscovered relationships which further our understanding of the brain.
We employ genetic programming, an artificial intelligence technique, to perform symbolic regression for the discovery of nonlinear models better suited to capturing the complexities of a high dimensional dynamic system: the human brain.
fMRI data for multiple subjects performing different tasks were segmented into regions of interest and nonlinear models were generated which effectively described the system succinctly. The nonlinear models contained undiscovered relationships and selected different sets of regions of interest than traditional tools, which leads to more accurate understanding of the
functional networks.

Xiang Li

Triply Stochastic Gradients on Multiple Kernel Learning

Multiple Kernel Learning (MKL) is highly useful for learning complex data with multiple cues or representations. However, MKL is known to have poor scalability because of the expensive kernel computation. Dai et al. proposed to use a doubly Stochastic Gradient Descent algorithm (doubly SGD) to greatly improve the scalability of kernel methods. However, the algorithm is not suitable for MKL because it cannot learn the kernel weights. In this paper, we provide a novel extension to doubly SGD for MKL so that both the decision functions and the kernel weights can be learned simultaneously. To achieve this, we develop the triply Stochastic Gradient Descent (triply SGD) algorithm which involves three sources of randomness -- the data points, the random features, and the kernels, which was not considered in previous work. We prove that our algorithm enjoys similar convergence rate as that of doubly SGD. Comparing to several traditional MKL solutions, we show that our method has faster convergence speed and achieved better accuracy. Most importantly, our method makes it possible to learn MKL problems with millions of data points on a normal desktop PC.

2B Computer Vision & Image Processing

Egor Chesakov

Vascular tree estimation from micro CT images using GPU-accelerated centerline regularization

Many standard methods for vascular tree estimation require segmentation of the vessels. The simplest methods (e.g. thresholding or region growing) work only on relatively thick vessels, but fail on thinner vessels due to partial volume and noise. It is technically possible to apply standard first-order regularization methods (e.g. graph cuts, level-sets), but such methods implicitly enforce compact shape prior (spheres or blobs) manifesting itself as a well-known “shrinking bias”. This makes such segmentation methods fundamentally inappropriate for thin structures. These structures require second-order surface regularization based on Gaussian curvature, but no robust algorithms are known for this general geometric model. For example, second-order deformable models need good initialization requiring a priori known vessel topology. Interactive methods with user-clicks on individual vessels are practical only for small trees.
We propose a method directly estimating vessels centerlines by regularizing their curvature using the energy model in Marin et al. [1]. We bypass the complex vessel segmentation problem. Our main insight is that the curvature model for 1D contours/centerlines is much simpler than (e.g. Gaussian) curvature for a surface. While our approach is related to medial axis estimation methods, they do require segmentation of the vessel boundary as an input. GPU-acceleration is essential in [1] and reduces total running time from one day to a few minutes [2].
References: [1] Dmitrii Marin, Yuchen Zhong, Maria Drangova, Yuri Boykov, “Thin Structure Estimation with Curvature Regularization”, in International Conference on Computer Vision (ICCV), Santiago, Chile, December, 2015. [2] Egor Chesakov, “Vascular Tree Structure: Fast Curvature Regularization and Validation”, MSc thesis, CS department, UWO, London, Canada.

Seereen Noorwali

Hierarchical Scene and Range Flow Estimation

Optical flow computation is one of the oldest and still most active research field in computer vision and image processing. Optical flow methods try to calculate the motion between two image frames. In 2D images, it specifies how much each pixel moves between adjacent frames; in the 3D case, it specifies how much each voxel moves between adjacent volumes in the dataset. Several algorithms, since 1980, have successfully estimated 2D and 3D optical flow. Scene flow and range flow can be considered to be special cases of 3D optical flow. Scene flow is the 3D optical flow of voxels on a moving surface. Scene flow uses disparity and disparity gradient maps computed from a stereo sequence and the 2D optical flow of the left and right images in the stereo sequence to compute 3D motion. Range flow is identical of scene flow but is calculated from sequences of depth maps or range data sets. Obviously, the algorithms to compute scene flow and range flow overlap. Therefore, we propose new insights that can help range flow algorithms to advance to the next stage. It can be enhanced to allow large displacements by using a hierarchical framework. Moreover, segmentation algorithms can assist to give the movement of each object separately in the range data. These scene and range flow algorithms can be tested using several existing stereo sequences and range data that can be acquired from existing synthetic and real car driving sequences or by using a Kinect or ShapeGrabber in the lab to construct our own sequences.

Md. Junaedur Rahman

Modeling driver behaviour for Advanced Driving Assistance System

Traffic collision is a continuous threat to human life till now. It occurs when a car collides with another vehicle, human or any other stationary object. These collisions result injury, death, vehicle damage, and property damage. The survivors often suffer psychological affects like trauma in the post collision phase. This issue is so serious to endanger human life therefore demands solid research to identify causes and provides more security. Advanced Driving Assistance Systems (ADAS) are developed to offer better safety and security in the driving environment. This system offers technology to the driver to improve accuracy level and help avoiding collisions by implementing safeguards.

3B Computer Algebra

Davood Mohajerani

FFTs over Prime Fields of Large Characteristic and their Implementation on GPUs

Prime field arithmetic plays a central role in computer algebra and supports computation in Galois fields which are essential to coding theory and cryptography algorithms.
The prime fields that are used in computer algebra systems, in particular in the implementation of modular methods, are often of small characteristic, that is, based on prime numbers that fit on a machine word. Increasing precision beyond the machine word size can be done via the Chinese Remaindering Theorem or Hensel Lemma.
In this work, we consider prime fields of large characteristic, typically fitting on n machine words, where n is a power of 2. When the characteristic of these fields is restricted to a subclass of the generalized Fermat numbers, we show that arithmetic operations in such fields offer attractive performance both in terms of algebraic complexity and parallelism.
In particular, these operations can be vectorized, leading to efficient implementation of fast Fourier transforms on graphics processing units.

Masoud Ataei

On the Extended Hensel Construction

The Extended Hensel Construction (EHC) is an algorithm which is used for factorizing univariate polynomials with power series coefficients and multivariate polynomials. It was proposed in 1999 by Tateaki Sasaki and Fujio Kako. Their goal was to provide a practically more efficient alternative to the classical Newton-Puiseux's method for bivariate polynomial factorization.
In this talk I will explain the algorithm and our contribution for improving it. We also implemented EHC with improvement in the PowerSeries library on Maple, available at www.regularchains.org.

Parisa Alvandi

Computing limit of real multivariate rational functions

Computing limits of functions is a basic task in multivariate calculus and many fundamental concepts in mathematics are defined in terms of such limits. The case of univariate functions has been well studied. But surprisingly, the case of multivariate rational functions is still an active research area. We present an algorithm for determining the existence of the limit of a real multivariate rational function q at a given point p. When the limit exists, the algorithm computes it, without making any assumption on the number of variables. For this algorithm, it is proved that taking the limit of q along finitely many special paths on the surface defining q=o and going through the point p is enough to determine whether or not the limit of q exist at p.

Ruijuan Jing

Computing the integer points of a polyhedron

For a polyhedral set K, given by a system of linear inequalities, with rational number coefficients, we propose an algorithm for computing a representation of the integer points of K. This representation is in terms of "simpler" polyhedral set. To use the terminology introduced by William Pugh for the Omega test, any such simpler polyhedron has an empty grey shadow. Moreover, we prove that, under mild assumption, the number of bit operations for performing this algorithm is single exponential in the number of facets of K and the height of the input coefficients.
Similarly to the Omega test, our work is motivated by problems arising in the analysis of computer programs, in particular in support of the automatic generation of code targeting graphics processing units.

Event Time:
April 10, 2017 8:30am-5:00pm
Location:
Middlesex College, University of Western Ontario
Registeration:
Closed Now