Program: |
Tentative program
November 10th (Thursday), 2016
(Venue: Room D107, Building D, U. of Skövde)
09:30 - 10:00 Opening
- Jörgen Hansson, Head of the School of Informatics
10:00 - 10:50 Invited talk:
-
Ola Gustafsson (Dagens Nyheter),
Chair: Alan Said
News recommendation with diversity and reduced gender bias at Dagens Nyheter
Online news recommendation is a field of many practical challenges but also one of conflicting goals. As we optimize personalized recommendations for pageviews and click-through rates, we may also reinforce bias and the creation of filter bubbles. At Dagens Nyheter, we design algorithms for personalization in a way that aims to align with editorial ambitions of diversity and reduced gender bias. Maybe it is time to talk about consumer awareness, as an aspect of algorithmic design?
11:00 - 12:00 Session Chair: Maria Riveiro
- A smoothed monotonic regression via L2 regularization
Oleg Sysoev, Oleg Burdakov (LIU)
Monotonic Regression (MR) is a standard method for extracting a monotone function from non-monotonic data, and it is used in many applications. However, a known drawback of this method is
that its fitted response is a piecewise constant function, while practical response functions are often required to be continuous. We propose a method that achieves monotonicity and
smoothness of the regression by introducing an L2 regularization term, and it is shown that the worst-case complexity of this method is O(n2). In addition, our simulations demonstrate that
the proposed method is very fast, i.e. it is able to fit more than a million of observations in less than one minute, and it has a higher predictive power than some commonly used alternative
methods, such as monotonic kernel smoothers. In contrast to these methods, our approach is probabilistically motivated and has connections to Bayesian modeling.
[1] O. Sysoev and O. Burdakov. A smoothed monotonic regression via l2 regularization. Technical Report LiTH-MAT-R–2016/01–SE, Department of Mathematics, Linkoping University, 2016.
[urn.kb.se]
- Approximate Search in Large Intrusion Detection and SPAM Filtering Data Sets
Ambika Shreshta Chitrakar, Slobodan Petrovic (NTNU, Gjøvik, Norway)
Due to enormous amount of data traffic and high data rates, the quantities of data to be analyzed by Intrusion Detection Systems (IDS) and SPAM filters per unit of time have become the
limiting factor of their further development. The efficiency of current search algorithms used in these systems is not high enough for real time data processing that is necessary in order for
the attacks against computer systems to be detected in real time. There are two possibilities for further development of these systems: improving the efficiency of the search algorithms and
reducing the data sets to be analyzed at a time. Regarding the efficiency of the search algorithms, there are limitations to using the theoretically best possible algorithms (so-called skip
algorithms, such as Backward Non-deterministic DAWG Matching) since these algorithms are sensitive to algorithmic attacks against the very IDS. Namely, the average-case time complexity of
these search algorithms is much better than the worst-case complexity and consequently an attacker can deliberately send attack traffic that makes these algorithms perform poorly. On the
other hand, regarding the size of the data sets to be processed at a time, there is potential for reduction since many attack signatures have common structure due to the fact that the new
attacks often originate from the old ones. This drives application of approximate search in intrusion detection, which is capable of detecting many similar attacks with only one execution of
the search algorithm. Thus, by using approximate search in intrusion detection, we obtain a more efficient search algorithm operating over a reduced dataset. This has potential of significant
improvement of the efficiency of IDS. In SPAM filtering, the spammers want to avoid detection by deliberately changing the SPAM words. Approximate search is also capable of detecting such
cases. To avoid so-called false positives and false negatives in both intrusion detection and SPAM filtering, we introduce constraints in the approximate search algorithms that limit the
total numbers of edit operations and/or the lengths of runs of edit operations. The constraints exploit the fact that the attackers/spammers cannot apply just any number of edit operations on
the traffic they generate and that the distribution of these changes cannot be arbitrary. Otherwise, the attacks might behave in an unpredictable way and the SPAM messages would lose their
intelligibility. This talk explains how these constraints are used and what their effect is on the numbers of false positives/negatives and the efficiency of the search algorithms.
[1] A. S. Chitrakar, S. Petrović, Approximate search with constraints on indels with application in SPAM filtering, Proc. Norwegian Information Security Conference (NISK-2015), pp. 22-33.
[2] A. S. Chitrakar, S. Petrović, Constrained row-based bit-parallel search in intrusion detection, submitted to NISK 2016.
- Root-Cause Localization using Restricted Boltzmann Machines
H. Joe Steinhauer, Alexander Karlsson, Gunnar Mathiason, Tove Helldin (HIS)
Monitoring complex systems and identifying degrading system components before the system or parts thereof fail, is crucial for many application areas, among them today’s and future
telecommunication systems. With the increasing complexity of such systems, the need to aid human operators through the use of machine learning tools is growing. In this paper, we present an
automated approach for root-cause localization, a first step towards root-cause analysis, using a Restricted Boltzmann Machine (RBM). We describe an experiment conducted on data with ground
truth, stemming from a simplified network. We use the RBM to cluster symptoms of degradation and we show how the results produced by the RBM capture the location of different possible
combinations of hidden root causes.
[1] H. J. Steinhauer, A. Karlsson, G. Mathiason, T. Helldin, Root-Cause Localization using Restricted Boltzmann Machines, Proc. 19th Int. Conf Information Fusion.
12:00 - 13:30 Lunch time
13:30 - 14:20 Invited talk:
-
Svetoslav Marinov (Seal Software)
Chair: Joe Steinhauer
Machine learning for contract analytics
At Seal Software we apply Machine Learning techniques extensively to analysing legal contracts and we use both supervised, and unsupervised learning. With the latest release of our product we give the users the possibility to create their own models based on their own manually annotated data. This feature came with a lot interesting problems, like which are the optimal evaluation techniques, as well as how should we handle imbalanced learning (e.g. where the underlying training data is quite skewed). In this talk, I will walk you through the way we are tackling these and some other problems in a user-driven machine learning environment.
14:30 - 15:30 Session Chair: Tove Helldin
- Efficient Parameter Tuning for Image Binarization
Florian Westphal, Håkan Grahn, Niklas Lavesson (BTH)
Image binarization is the first important processing step when making
historical document images searchable, transcribing them or analyzing the
layout of these documents. A good binarization quality is paramount for
those tasks, since only the detected foreground pixels will be processed
further. With ever growing collections of digitized documents, efficient
processing becomes equally vital. In our work, we propose a fast way for
tuning the parameters of a state-of-the-art binarization algorithm. These
parameters adjust the binarization algorithm to a given image to improve
binarization quality. By predicting the algorithm¹s parameters based on
image features, such as contrast, homogeneity, edge mean intensity and
background standard deviation, we are able to tune the algorithm¹s
parameters on average 3 times faster than previous approaches. This is an
average time difference of 17 seconds on a standard binarization dataset,
which is achieved without decreasing the binarization quality.
- FlinkML: Large Scale Machine Learning with Apache Flink
Theodore Vasiloudis (SICS & KTH)
Apache Flink is an open source platform for distributed stream and batch data processing. In this talk we will show how Flink's streaming engine and support for native iterations make it an excellent candidate for the development of large scale machine learning (ML) algorithms.
This talk will focus on FlinkML, an effort to develop scalable machine learning tools utilizing the efficient distributed runtime of Apache Flink. We will provide an introduction to the library, illustrate how we employ state-of-the-art algorithms for classification, regression and recommendation to make FlinkML truly scalable, and provide a view into the challenges and decisions one has to make when designing a robust and scalable machine learning library. A focus will be given on the challenges encountered in developing a community-driven open-source ML library.
Finally, if time permits, we will demonstrate how one can perform interactive analysis using FlinkML and the notebook environment of Apache Zeppelin, combining the power of a distributed processing engine with more traditional data science tools like the Python scipy stack.
- Reconstruction of Causal Networks to Resolve Conflicting Causal Inferences
Sepideh Pashami(1), Anders Holst(2), Sławomir Nowaczyk(1) ((1) Halmstad University, (2) SICS)
One of the challenges for maintenance of vehicles is finding the root cause of the fault in order to avoid reoccurrence of the failure and to further avoid undesirable follow-up failures.
Besides, large savings can be obtained by focusing on components where failures are likely to cause costly collateral damage, for example, over-speeding and destruction of turbocharger often
leads to additional engine damage. A causal network is useful for providing the overall picture of how various parameters are affecting the vehicle’s performance.
With the increased amount of diverse data being collected, there are new opportunities to distinguish between correlation and causation, either automatically or semi-automatically. The focus
of this work is to identify the causal relations between signals measured on-board heavy duty vehicles. A proposed method reconstructs the causal network using the PC algorithm [1] in order
to get closer to underlying causal structures between signals. The PC algorithm builds a Markov equivalence class which contains the underlying causal graph, and represents it by a Completed
Partially Directed Acyclic Graph (CPDAG). The calculations are based on some forms of conditional independence test. In many cases, the CPDAG produced contains bi-directed edges. Such
bi-directed edges are undesirable, as they imply the existence of a confounding variable, i.e. an unknown factor which influences both of the signals (nodes). Similarly, fully connected nodes
(cliques) within causal network are undesirable due to the ambiguity of identifying cause and effect. We relax the sufficiency assumption by adding one or more latent variables. We connect
the latent variables to the nodes in the cliques. Further, we assign values to these latent variables in a way that nodes in each clique become independent given the corresponding latent
variable. After that, the PC algorithm is rerun with the new set of variables until no more conflicting variables remains.
The effectiveness of the proposed approach is demonstrated on a data set collected by a fleet of five city buses. In particular, the aim is to identify the causal relation between the set of
signals influencing fuel consumption. This analysis is performed only based on observational data, without the need for specifying the underlying physical model.
[1] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press, 2000.
15:30 - 16:00 Coffee
16:00 - 17:00 Session Chair: Göran Falkman
- Energy Efficiency in Machine Learning
Eva García Martín, Niklas Lavesson, Håkan Grahn (BTH)
Energy efficiency has become a key concern during the past years in software development. Researchers in machine learning are starting to understand the importance of designing solutions from
a sustainable perspective. For instance, deep learning techniques and algorithms, known for their high performance solutions, are being optimised towards energy efficiency. Google is also
applying machine learning techniques to reduce the energy consumption of their servers.
However, we believe that there needs to be a systematic approach to add the energy efficiency variable to algorithm analysis, since algorithms are still developed considering the traditional
variables.
The goal of this study is to present a reproducible approach to analyze and optimize machine learning algorithms from an energy efficiency perspective. The energy consumption is examined
together with the accuracy, to portray the different trade-offs that exist when trying to reduce the energy consumption of a computation. We created an experiment where we measure the energy
consumption and accuracy of different data stream mining algorithms and algorithmic setups.
The results show that energy can be reduced by 74.29% in the Very Fast Decision Tree algorithm by sacrificing accuracy with just 1%.
The main contribution of this work is the validation that energy is an important factor to take into consideration when designing algorithms. Since this factor is often overlooked, we believe
that different algorithms could be chosen for a specific task from a green computing perspective depending on the accuracy constraints.
This work also enables us to optimise algorithms for computer platforms with scarce resources, such as embedded systems.
[1] Work in progress. Submitted also to WiML (Women in Machine Learning) 2016
- Random Forest Response Surfaces for Robust Design
Siva Krishna Dasari (1), Niklas Lavesson (1), Johan Wall (1), Petter Andersson (2) ((1) BTH, (2) GKN Aerospace Engine Systems Sweden)
Intelligent data analysis is increasingly used within the area of product development for different types of decision support. One example of this is the construction of response surface
models (RSMs) in a model-based approach to robust design. The construction of RSMs requires a dataset of inputs, and known outputs from simulated experiments. Since simulations are expensive
to conduct, datasets are usually small in a real-world context. The size of the data sets and the complexity of the underlying simulation model make it difficult to generate accurate and
robust RSMs. For robust design, aiming to reduce the variation in system performance, sensitivity analysis (SA) based on RSMs allows for efficient studies of how uncertainties in input
parameters affect system performance. In this study, we investigate the applicability of Random Forests (RF) for RSMs and consecutive SA. The reasons for selecting RF are that: (1) it can
handle non-linear data (2) it can handle high-dimensional data (3) it gives the importance of parameters by ranking them (4) ensemble methods generally build accurate models compared to
single models and (5) if the design engineers need information about variable interactions, it is possible to extract human understandable decision rules from tree models. To determine
whether RF can perform as well as other methods, we compare RF to Multivariate Adaptive Regression Splines (MARS) and Support Vector Machines (SVM). We conducted two experiments using
anonymized real world and synthetic data for RSMs and SA respectively. The output from the RSMs suggests that the three studied algorithms perform equally well with respect to predictive
accuracy. The output from the SA suggests that RF and MARS performs equally well and better than SVM on non-linear response. Furthermore it was shown that RF is more computationally efficient
compared to MARS and SVM. Experimental results combined with other potential benefits of RF related to robust design, such as the ability to screen parameters, indicates that RF is suitable
for the intended application.
[1] Siva Krishna Dasari, Niklas Lavesson, Johan Wall and Petter Andersson. Random Forest Response Surfaces for Robust Design. Submitted to IEEE Int. Conf. on Machine Learning and Applications (IEEE ICMLA'16), 2016.
- Data privacy and data provenance for big data
Vicenç Torra (HIS)
In this work we will discuss the problem of data privacy for big data. We will focus on the relationship between data provenance and data privacy, and show that data provenance can be used to implement the right to amend and the right to be forgotten. We will introduce a definition of privacy for the case in which modifications in a data base are sensitive information.
[1] V. Torra, G. Navarro-Arribas, Integral Privacy, Proc. CANS 2016, LNCS 10052, 661-669.
November 11th (Friday), 2016
(Venue: Room Insikten, Building Portalen, U. of Skövde)
08:30 - 9:00 Coffee
09:00 - 09:50 Invited talk:
-
Alexander Schliep (Gothenburg University)
Chair: Alexander Karlsson
Compressive Omics: Data science for biomedical applications
The biomedical field has seen an enormous change due to the rapid advances in throughput and cost of experimental instruments, automatization as well as expansion of experiment types and modalities. For example, High-Throughput Sequencing (HTS), a technology to unravel genomic sequences on a large scale, is pervasive in clinical and biological applications such as cancer research and basic science, and is expected to gain enormous momentum in future precision medicine applications. As a consequence, the storage, processing and transmission of HTS data poses great challenges for method developers and practitioners.
Compressive Omics, the use of compressed and reduced representations of biological data, was identified by the NIH as one of the core techniques for developing methods which can keep up with the increases in data. In contrast to typical uses of compression to reduce storage requirements, the focus is on representations suitable for computational analysis.
We will present our work on using compressed and reduced representations for accelerating advanced statistical computations at genome-scale. This includes recent results on fully Bayesian Hidden Markov Models for identifying Copy Number Variants (segmentation of observation sequences) and compressive genomics.
10:00 - 10:30 Coffee
10:30 - 11:50 Session Chair: Juhee Bae
- Short-term highway traffic prediction using dynamic parameter combinations in k-nearest neighbours with comprehensive understanding of its parameters
Bin Sun, Wei Cheng, Prashant Goswami, Guohua Bai (BTH)
Increasing road traffic is nowadays causing more congestion and accidents which gain more attention from public and authorities due to severe loss of lives and properties.
To achieve efficient traffic management and accident detection, reliable and accurate short-term traffic forecasting is necessary.
The k-nearest neighbours (KNN) method is widely used for short-term traffic forecasting, but choosing the right parameter values for KNN is problematic due to dynamic traffic
characteristics.
In this work, we comprehensively analyse the relationship among all three KNN parameters, which are number of nearest neighbours, search step length/lag, and widow size/shift constrain.
We observed that individual parameter optimization cannot lead to the best parameter values. Thus, optimizing three parameters simultaneously is necessary.
We propose a dynamic procedure that can use suitable parameter combinations of KNN to predict traffic flow metrics. The proposed procedure adjust combinations dynamically according to
traffic flow situations. The results show that KNN with dynamic procedure performs better than benchmarking methods.
[1] Submitted to journal: IET Intelligent Transport Systems, June 2016.
- Market Share Prediction Based on Scenario Analysis Using a Naive Bayes Model
Shahrooz Abghari, Niklas Lavesson, and Håkan Grahn (BTH)
In today’s competitive marketing environment such as telecom, companies fight for gaining a larger market share. The importance of the market share is due to its direct relationship with the
profitability and sustainability of a company. The higher the share, the greater a company’s chances to achieve high revenues that can consolidate the company’s position in the market.
Designing a successful marketing strategy is a complicated process that always requires a careful and realistic assessment of a company's market position. Moreover, it requires an estimate of
the future market size growth based on supply and demand scenarios together with an estimate of the future market shares based on competitors’ objective and strategies.
In order to overcome this complicated process, companies often aggregate different sources of information to analyze different scenarios and predict the market shares. For instance, the
market share can be predicted by considering the effect of introducing a new product/service to the current market and/or a new market, or the effect of increasing, decreasing, and mentioning
the company’s share in one specific region.
This work presents an ongoing case-study from a telecom company and aims at predicting the market share throughout applying different scenarios. These scenarios range from introducing a new
service/product, aiming a new market to analyzing the effect of economic crises on the market share. To achieve this a decision modeling based on a naive Bayes model is proposed. The naive
Bayes model is a useful tool for reasoning under uncertainty and where expert’s knowledge is not complete and/or ambiguous. The performance of the model is going to be evaluated with real
data and the validity of the results will be investigated by the experts at the telecom company.
- Learning Meaningful Knowledge Representations for Self-Monitoring Applications
Sławomir Nowaczyk, Sepideh Pashami, Mohamed-Rafik Bouguelia, Antanas Verikas (Halmstad University)
The ability to learn meaningful and useful data representations is crucial for machine learning applications that need to maintain good performance as they are being applied to more and more
general problems in more and more complex settings. In this work we present an awareness-based self-monitoring system that learns how to continuously extract useful information from streaming
data. In a setting when a group of individuals is available for observation and comparisons, new challenges arise with regards to representation learning.
We focus on the possibility of creating a general meta-framework for the purpose of self-monitoring, more specifically, to detect an emerging anomaly for an individual system, one that can
lead to a failure in a near future. We are interested in finding an appropriate representation of the “normal” behavior or reference, both for an individual and for the whole group, based on
parameters such as configuration, task or external conditions. A suitable comparison method then needs to be developed to determine whether a given observation, or a set of observations over
time, is sufficiently similar to this reference to be considered normal.
We have identified several different aspects and challenges in that context. First, one needs to automatically select the most suitable representation paradigms to represent the available
data in a compact and expressive way, based on the properties of the data, the details of the task to be solved, and other constraints of the domain. The automatic determination of relevant
data representation requires designing models that allow for efficient learning, while being flexible enough to capture different aspects of the data simultaneously and take into account
different kinds of initial domain and expert knowledge.
Second, a common assumption for anomaly detection is that the majority of observations represent the correct behavior of the system. However, in many real-world applications, both complete
failures and the optimal functioning are quite rare, and the majority of the data corresponds to a mediocre operation.
Moreover, patterns of normal behavior become less and less distinct as the number of individuals within the reference group increases, due to the inevitably higher variability. Appropriate
groups for self-monitoring are based on domain-specific relevant criteria, and influenced by external factors such as weather, which can be subject to change over time. Distinguishing between
usual changes (e.g. seasonal changes) and unusual changes that are anomalous (e.g. indicating a deviation of an individual from a group) poses challenges for the representation learning.
Finally, the created models should not only be capable of providing end users with descriptive, explanatory analysis, and rich visualization functionality, but should also be capable of
taking into account the expert’s feedback in order to improve representations.
- Accelerating science with big data at Chalmers
Hans Salomonsson, Oscar Ivarsson, Azam Muhammad Sheikh, Pramod Bangalor (Chalmers)
The recent revolution in data science has had a profound influence on the research methods in many areas. In order to take full advantage of this revolution, researchers need to develop and adapt to these new research methodologies. But the complexity of the underlying methods, tools and infrastructure can be overwhelming for individuals within a single research group. To address this problem, Chalmers has created a pool of data science experts that work across research groups. This talk will give an overview of a few projects the team has completed to demonstrate the usefulness and variety of big data technologies.
11:50 - 12:00 Closing session
|
Topics:
|
What is data science and why is it relevant?
Data science focuses on the extraction of knowledge from data. The overall aim is to make better use of the ever increasing amount of data generated by individuals, societies, companies, and science. To achieve this aim, the objectives are to identify relevant challenges and problems, to study, develop and evaluate solutions based on efficiency and effectiveness, and to perform successful implementations. Data science is based on theory and methods from many fields, including: computer vision, data mining & knowledge discovery, machine learning, optimization, statistics, and visualization.
Data science is not about blindly sifting through data in the hope for interesting results and discoveries. In contrast, data science requires the ability to make sense of our complex world and the domain under study, and to use this understanding and the available data to develop suitable mathematical models that help us explain and predict interesting phenomena.
Topics of Interest include, but are not limited to, the following
Methods and Algorithms
- Classification, Clustering, and Regression
- Probabilistic & Statistical Methods
- Graphical Models
- Spatial & Temporal Mining
- Data Stream Mining
- Feature Extraction, Selection and Dimension Reduction
- Data Cleaning, Transformation & Preprocessing
- Multi-Task, Multi-label, and Multi-output Learning
- Big Data, Scalable & High-Performance Computing Techniques
- Mining Semi-Structured or Unstructured Data
- Text & Web Mining
- Data privacy
Applications
- Image Analysis, Restoration, and Search
- Climate/Ecological/Environmental Science
- Risk Management and Customer Relationship Management
- Genomics & Bioinformatics
- Medicine, Drug Discovery, and Healthcare Management
- Automation & Process Control
- Logistics Management and Supply Chain Management
- Sensor Network Applications and Social Network Analysis
|