[ English | Japanese ]
Research Outline
We are working on various aspects of machine learning techniques
ranging from the development of fundamental theories
and practical algorithms to the application to real-world problems
(textbook in Japanese: Statistical Machine Learning),
(textbooks in Japanese: Pattern Recognition and Machine Learning).
Supervised learning is the problem of learning
an underlying input-output relation
based on input-output samples.
Once the underlying relation can be successfully learned,
answers for unlearned questions can be predicted.
Thus, the learning machine can generalize to unexperienced situations.
The goal of supervised learning research is to
let the learning machine acquire the best generalization
performance from the smallest possible number of training samples.
The supervised learning problem can be formulated as a function approximation
problem from samples.
Unsupervised learning is literally learning without supervisors;
only input data without output data (answer) are given.
The goal of unsupervised learning depends on situations
and the unsupervised learning problem is sometimes not mathematically well-defined.
Data clustering aiming at grouping similar data is a typical example.
In data clustering, how to measure the affinity between data samples
needs to be predetermined,
but there is no objective criterion that can quantitatively evaluate the
validity of the affinity measure; often it is merely subjectively determined.
Input-output samples are given in supervised learning
and input-only samples are given in unsupervised learning.
Semi-supervised learning is in the middle of these two situations---in
addition to the input-output samples, input-only samples
are also provided.
The goal of semi-supervised learning is the same as supervised
learning, i.e., to let the learning machine acquire
the best generalization performance.
In semi-supervised learning, the number of input-output samples
is typically small, while the number of input-only samples
is very large.
In such situations, the generalization
performance may be further improve by utilizing
a large number of input-only samples in addition
to a small number of input-output samples.
Reinforcement learning is the problem of learning a policy function
(a mapping from a state to an action) of an agent.
The policy function is an input-output relation
so the goal of reinforcement learning is the same as supervised learning.
However, different from supervised learning, the output data can not be directly observed.
Therefore, the policy function needs to be learned without supervisors;
but different from unsupervised learning, a reward is given for an agent's action.
The goal of reinforcement learning is, based on the information of the reward,
to learn the policy function such that the sum of rewards the agent will receive
in the future is maximized.
A typical approach to reinforcement learning does not directly learn the policy function
itself; it learns the value function, a mapping from a state (and an action)
to the sum of rewards.
The desired policy function can be generated based on the value function.
(book in Japanese: Training Robotic Game Players by Reinforcement Learning), (book: coming soon).
In order to acquire a higher level of generalization capability
in supervised learning,
it is important to determine the complexity of learning machines (models) appropriately.
If the model is too simple, it is not flexible enough to represent the learning target function
and therefore the generalization performance can not be improved even if a large amount of
training data is employed for learning.
On the other hand, if the model is too complex, it is capable of representing the target function,
but such a complex model is heavily affected by the noise in the training data.
Therefore, in practical situations with a rather small number of training samples,
complex models do not give better generalization performance.
Model selection is one of the most fundamental and central research topics
in supervised learning.
Model selection is generally carried out
by finding the model such that the generalization performance is maximized.
Therefore, the key of model selection research is how to derive an
accurate estimator of the generalization performance.
We tackled this model selection problem by a functional analytic approach
and developed an estimator of the generalization error called the
subspace information criterion (SIC)
(Neural Computation, 2001).
SIC has shown to work well because of its data-dependence;
thus SIC tends to outperform existing methods in single trials
(Machine Learning, 2002).
SIC was originally proposed as a criterion for selecting subspace models.
Later, we have extended SIC to be able to choose
the regularization parameter in linear ridge regression
(Neural Networks, 2002),
the regularization parameter in sparse linear regression
(IEEE Transactions on Neural Networks, 2002),
and
the regularization parameter in kernel ridge regression
(Journal of Machine Learning Research, 2002).
Furthermore, we theoretically evaluated
the performance of SIC
when the training input points also contain noise
(Neural Information Processing - Letters and Reviews, 2004),
and extended the range of application of SIC so that non-linear parameter learning methods are
also allowed (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007).
For the above research works,
the Young Researcher Award was given
from
Japanese Neural Network Society (JNNS)
in 2001,
the Information Technology Award
from
Funai Foundation for Information Technology
in 2003,
the Nakamura Research Award
from
Tejima Foundation for Industry and Education
in 2003.
Including SIC, the performance of model selection criteria is
usually evaluated by their unbiasedness.
However, unbiasedness doe not necessarily imply good model selection performance;
it is important to take the variance of the model selection criteria into account.
Based on this idea, we proposed the
corrected SIC (cSIC),
which is more accurate than the original SIC in the mean squared error sense
(IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2003).
Furthermore, we developed the
regularized SIC (RSIC)
which can more significantly reduces the variance of SIC
by slightly sacrificing its unbiasedness
(Neural Computation, 2004).
RSIC remarkably improves the stability and reliability of model selection.
We also proposed methods of analytically obtaining
the best model under RSIC that can improve the model selection accuracy and computational
efficiency at the same time;
for shrinkage learning
(IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2006)
and
for adaptive ridge learning (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007).
RSIC includes a regularization parameter in the model selection criterion,
which is optimized based on a meta model selection criterion.
We improved the meta criterion and showed
that the model selection performance is further improved
(IEICE Transactions on Information and Systems, 2007).
If training input location (question) is designed appropriately
in supervised learning,
one can obtain better generalization capability with a small amount of training data.
The problem of designing training input location is called
active learning (or experimental design).
The challenge of active learning is to predict the generalization performance
of a learning machine before training data is gathered.
Therefore, active learning is thought to be a more difficult problem than model selection;
particularly how to cope with the bias of the estimator,
which can not be generally predicted without
accessing to training data, is the key issue in active learning.
Standard active learning methods design the input location so that
the variance of a learning machine is minimized
under the assumption that the model at hand is correctly specified
(i.e., the learning target function is included in the model).
We showed that this standard approach actually consists of two stages:
the first stage corresponds to implicitly reducing the bias,
while the second stage is to minimize the variance.
However, the existing approach only explicitly solves the second problem---we
found that, the variance actually increases in the first stage.
To cope with this problem,
we proposed a two-stage incremental active learning method
(Neural Computation, 2000).
This method improves the performance
by explicitly suppressing the increase of the variance in the first stage
with guaranteeing that the bias vanishes.
The thesis that contains this work
was selected for the Doctor Dissertation Award
by
Tejima Foundation for Industry and Education in 2002.
The above two-stage method is practically useful.
However, it is an incremental method, i.e., the input location
is determined one by one in a sequential manner.
Therefore, it can only achieve the greedy optimality.
Theoretically, batch active learning---designing all input locations at the same time---is
globally optimal.
Following this fact, we developed
a globally optimal batch active learning method
when the learning target function is included in trigonometric polynomial models,
and elucidated its theoretical properties and geometric structure
(IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001).
For this achievement, we received the Young Engineer Award from
The Institute of Electronics, Information and Communication Engineers (IEICE)
in 2002.
In the above methods, the model at hand is assumed to be correct.
However, this assumption may not be satisfied in practice.
Therefore, developing an active learning method
applicable to misspecified models
(i.e., the learning target function is not included in the model)
is practically very important--but is known to be a hard problem
since it is difficult to control the bias of misspecified models.
For misspecified models, importance-weighted least-squares learning
is proved to give an asymptotic unbiased estimator.
Based on this fact, we proposed a batch active learning method called
ALICE
(Journal of Machine Learning Research, 2006).
ALICE is guaranteed
to give a minimum-variance design of input locations
under the constraint that the bias is asymptotically minimized.
ALICE has been applied to efficient design of sampling policies in
robot control.
ALICE is an effective active learning method
when we are allowed to set the input points at any desired location.
In some practical situations, however, we are only given a finite number of input points
and have to choose the best input points among the candidate points.
Such a scenario is called pool-based active learning.
We extended ALICE to pool-based ALICE (P-ALICE)
and showed its usefulness through extensive simulations
(Machine Learning, 2009).
PALICE has been applied to semi-conductor wafer alignment in in semiconductor exposure apparatus.
Active learning and model selection have been investigated separately so far,
although they share a common goal of improving the generalization performance.
If the model and training input location are optimized simultaneously,
the generalization capability could be further improved.
However, simply combining existing active learning methods and
existing model selection methods is not possible for carrying out
active learning and model selection at the same time;
the model must have been fixed in the existing active learning methods and
the training input location must have been determined
and corresponding training output values must have been observed
in the existing model selection methods.
We call this phenomenon the model selection/active learning dilemma.
For this challenging problem,
we proved that there exist the common optimal training input location
for trigonometric polynomial models.
Based on this result, we developed
an algorithm for active learning with model selection
for trigonometric polynomial models
that can successfully avoid the model selection/active learning dilemma
(IEICE Transactions on Information and Systems, 2003).
Furthermore, we extended the above method to accommodate a general class of models
and developed the method of ensemble active learning
(Neural Networks, 2009).
When training samples are sequentially given in
supervised learning,
it is computationally efficient to build the learning result
upon the previously learned results.
Learning in such an incremental manner is called incremental learning (or on-line learning).
We developed an incremental learning method called
incremental projection learning (IPL),
that only relies on incremental computation but
provides exactly the same learning result as that
obtained by batch learning with all training samples
(Neural Networks, 2001a).
We also elucidated theoretical properties of IPL
(Neural Networks, 2001b).
IPL forms a basis of developing
incremental active learning algorithms.
Furthermore, we proposed
an IPL-based method of constructing neural networks
(IEICE Transactions on Information and Systems, 2002).
For the above series of works on incremental learning,
the Inose Science Promotion Award was given
by the Foundation of Electrical, Electronics, and Information Science Promotion
in 1998.
Online learning is also important in the context of
density ratio estimation.
We developed an online version of the density ratio estimation algorithm
KLIEP, and
showed its effectiveness through change-point detection experiments
(SDM2009).
When the dimensionality of the input data is high,
it becomes extremely difficult to learn well from the data
(known as `the curse of dimensionality').
This difficulty could be alleviated if the dimensionality of the data is reduced.
The goal of dimensionality reduction is to reduce
the dimensionality of the data
without losing the intrinsic information contained in the data.
If the low-dimensional components are chosen as a subset of original features,
the dimensionality reduction task is called feature selection.
On the other hand, if the low-dimensional components are chosen as
(possibly non-linear) combination of original features,
it is referred to as feature extraction.
Feature selection is preferred in scientific applications
such as biology and chemistry
since interpretability of learned results is more important
than predictability.
On the other hand, in general engineering applications
such as pattern recognition,
predictability is important so feature extraction would be more useful.
The goal of supervised dimensionality reduction is to find a subset of input variables or a subspace of the input space which explains the output values the most.
Discriminant Analysis
A typical method of supervised dimensionality reduction
in classification scenarios
would be Fisher discriminant analysis (FDA).
When samples in each category follows the normal distribution with common covariance structure,
FDA provides the optimal solution.
However, samples in some category has separate clusters (i.e., multi-modal),
FDA does not give desired solutions.
To cope with this problem, we proposed
local Fisher discriminant analysis (LFDA)
which works well with multi-modal data
(Journal of Machine Learning Research, 2007),
(software).
For this work,
the Incentive Award was given by
Japanese Society for Artificial Intelligence (JSAI)
in 2007.
Sufficient Dimension Reduction
Sufficient dimension reduction is an approach of supervised dimensionality reduction
for both regression and classification, aimed at
finding an input subspace which is independent of outputs.
Based on a method of density-ratio estimation
and developed a sufficient dimension reduction method (AISTATS2010).
In semi-supervised learning
settings where input-only samples are given
in addition to input-output samples,
we expect to obtain better dimensionality reduction performance
by utilizing the input-only samples.
For semi-supervised classification scenarios,
we developed semi-supervised local Fisher discriminant analysis (SELF)
by combining principal component analysis (unsupervised dimensionality reduction)
and local Fisher discriminant analysis (supervised dimensionality reduction)
(Machine Learning, 2010), (software).
Furthermore, for very high-dimensional sparse massive datasets,
we developed a computational efficient implementation called sparse SELF (SSELF),
and showed its usefulness in document classification,
an important task in natural language processing
(IEICE Transactions on Information and Systems, 2009), (software).
Dimensionality reduction is also very important
in unsupervised learning scenarios.
The goal of unsupervised dimensionality reduction is
to obtain a low dimensional representation of the data
while preserving the intrinsic information contained in the original data.
However, this problem is not mathematically well defined since the meaning of
intrinsic information is rather vague.
Therefore, it is a cumbersome problem.
Non-Gaussian Component Analysis
We focused on the situation
where the original data can be decomposed into
signal components included in a low dimensional linear subspace
and noise components
since this enables us to rigorously formulate
the unsupervised dimensionality reduction problem.
If the signal components follow some non-Gaussian distribution
while the noise components follow the Gaussian distributions,
the projection pursuit (PP) algorithm can identify the signal components.
PP needs to prefix a single criterion called projection index
that measures the non-Gaussianity of the data.
However, it is known that a standard projection index
is suitable for searching super-Gaussian distributions
while another typical projection index is effective in
finding sub-Gaussian distributions.
Therefore, if the data contains both super- and sub-Gaussian signal components,
it is not obvious how to optimally choose the projection index.
To cope with this problem, we proposed
a general semi-parametric framework
of unsupervised dimensionality reduction called
non-Gaussian component analysis (NGCA).
Within the NGCA framework,
we proposed an algorithm called
multi-index projection pursuit (MIPP)
(Journal of Machine Learning Research, 2006).
MIPP allows us to employ a number of projection indices at the same time.
Therefore, one can reduce the dimensionality of the data even when
it contains both super- and sub-Gaussian components.
Although MIPP can overcome the traditional weakness of PP,
it is not still the optimal method within the NGCA framework.
We further developed an algorithm called the
iterative metric adaptation for radial kernel functions (IMAK)
(Annals of the Institute of Statistical Mathematics, 2007).
IMAK can estimate the signal subspace more accurately than MIPP and PP.
We also applied the NGCA method to the noise reduction problem
and developed an algorithm for
accurately estimating the best linear unbiased estimator
(IEICE Transactions on Information and Systems, 2008).
Direct Density ratio estimation with Dimensionality Reduction
Dimensionality reduction is also an important topic in the context of importance estimation / density ratio estimation.
We proposed a framework called Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube) for finding the subspace in which the numerator and denominator densities in the density ratio are different.
We developed a computationally efficient D3 algorithm which combines
Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007)
and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009).
Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS) which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (Neural Networks, 2011).
Input samples are directly used for learning
in standard supervised learning,
but it is possible to use similarities among input sample points.
The matrix that contains all pairwise similarities
between input sample points is called the similarity matrix.
When the similarity matrix is defined through positive semi-definite kernels,
it is called the kernel matrix and learning algorihtms using kernel matirces
are commonly referred to as kernel methods.
Any linear algorithms based on inner products
can be non-linearized without sacrificing computational simplicity
of linear algorithms by replacing the inner products with kernel functions.
Applying this technique, we can non-linearize our proposed algorithms:
LFDA, SELF, SSELF(dimensionality reduction>),
KLIEP, LSIF, uLSIF(density-ratio estimation).
The computational complexity of kernel methods generally depends
cubically on the number of training samples,
and therefore it is difficult to apply the kernel methods to large-scale datasets.
For kernel partial least-squares,
we proposed an approximation method
which allows us to compute the degrees-of-freedom
and error bars efficiently
(AISTATS2009).
The nu support vector machine is a powerful kernel-based classificaiton algorithm.
However, under strict formulation, the nu support vector machine
involves a non-convex optimization problem and
there is no known method for obtaining its global optimal solution.
Based on cutting plane mehtods, we developed a new optimization algorithm
which is guaranteed to find the global optimal solution
(New Generation Computing, 2009).
We also published a review paper on recent advances in large-scale kernel methods
(IEICE Transactions on Information and Systems, 2009).
Sparse learning methods are incresingly popular these days
due to its high generalization performance with a large number of
explanatory variables.
We developed an augmented Lagrangian algorithm for efficient
training of sparse regularization models
(IEEE Signal Processing Letters, 2009)
Kernel least-squares fitting is not robust against outliers.
We applied the kernelized l_1 fitting method to
robot control
and showed its usefulness (ICRA2009).
In order to obtain a better generalization performance,
feature selection of input samples is important.
However, it is often difficult to explicitly extract good features
from input samples.
In such cases, a good generalization performance could be obtained
if feature selection is implicitly carried out
by defining similarities between input samples,
rather than explicitly selecting features.
Gaussian kernels are general-purpose similarity functions
and give good results on the whole.
However, if some prior knowledge of the problem domain is available,
the generalization performance and computational efficiency
could be improved by utilizing the prior knowledge.
We proposed the principal component kernel (PCK)
which can be computed based on the prior distribution of the learning target function
(IEICE Transactions on Information and Systems, 2006).
We constructed a specific PC kernel for the binary regression problem
and showed its effectiveness.
Kernel functions are also often employed in value function approximation
in reinforcement learning.
We proposed the geodesic Gaussian kernel (GGK),
which can be constructed based on the manifold structure of the state space.
We demonstrated the effectiveness of GGKs in robot arm control and
Khepera robot navigation tasks
(Autonomous Robots, 2008),
(demo).
Standard kernel methods deal with a single similarity matrix,
but we are sometimes given multiple similarity matrices.
In such cases, the generalization performance could be improved
if the information brought by multiple similarity matrices are combined.
We proposed a new method that can
learn from multiple similarity matrices.
The proposed method is robust in the sense that
noisy similarity matrices are automatically de-emphasized
(IEEE Transactions on Neural Networks, 2009).
In some supervised learning situations,
input samples themselves are not provided, but only the similarity
between input points can be observed.
We can use kernel methods if the similarity matrix is positive semi-definite,
but the similarity matrices stemmed from real problems do not necessarily
posses positive semi-definiteness.
To cope with this problem, we derived a necessary condition
of similarity matrices for achieving a better generalization performance
and proposed a boosting-type algorithm based on the theory
(Neural Computation, 2009).
In standard supervised learning,
it is commonly assumed that the samples used for training
follows the same probability distribution as the test samples.
However, this assumption is not always satisfied in practice.
For example, when the data generation mechanism is non-stationary,
training and test distributions may be different.
In such situations, just using training samples as they are
does not provide a better generalization performance
due to the distribution difference.
We co-organized
the NIPS2006 Workshop
on this issue and edited a book
(Dataset Shift in Machine Learning).
Furthermore, we published a monograph on covariate shift adaptation
(book: coming soon).
The situation where
the input distribution is different in training and test phases
but the functional relation remains unchanged
is called the covariate shift.
Note that the covariate is the name of the input variable in statistics.
A typical example of covariate shift scenarios
would be an extrapolation problem where output values
at a region with only a few training samples are learned.
When active learning is
carried out, covariate shift is naturally induced.
The problem of covariate shift adaptation is to learn the target function
(which is unchanged) given input-output training samples and input-only
test samples
(The Brain & Neural Networks, 2006, in Japanese),
(Image Lab, 2007, in Japanese),
(article in a book).
The unbiasedness of model selection criteria including the subspace information criterion (SIC)
is no longer maintained under the covariate shift.
To cope with this problem, we proposed an extended SIC
called importance weighted SIC (IWSIC)
that is still unbiased even under the covariate shift
(Statistics & Decisions, 2005).
IWSIC is specialized for the squared loss function, so is not applicable
to classification tasks.
Cross validation is a standard model selection technique in classification scenarios,
but its unbiasedness is lost under the covariate shift.
We therefore developed an extended cross validation method called
importance weighted cross validation (IWCV)
that works desirably in covariate shifted classification tasks
(Journal of Machine Learning Research, 2007).
Furthermore, we theoretically elucidated the effect of covariate shift
on Baysian inference with singular models
(ICML2007).
The effectiveness of covariate shift adaptation methods has been
demonstrated in various real-world problems such as
brian-computer interface,
robot control,
speech processing,
natural language processing,
and
computer vision.
For the above series of works on covariate shift adaptation,
the IBM Faculty Award
is given by in 2007.
Domain adaptation / transfer learning deals with the situation where the functional relations
also changes between training and test phases
in addition to the input distributions.
Without any restrictions on the distribution change,
the test function is nothing related to the
training samples and therefore we may not be able to learn anything
about the test function from the training samples.
Thus, in order to make a meaningful discussion,
we need a reasonable restriction on the distribution change.
We consider a restriction where
the training input-output distribution is a mixture
of two distributions
and the test input-output distribution is one of the two distributions.
Under this restriction,
we developed a learning method
that can learn the test function
given input-output training samples and input-only test samples
(NIPS2006).
We developed a transfer learning method which enforces the target solution to be similar to related source solutions,
and showed its effectiveness in biological network inference (International Journal of Knowledge Discovery in Bioinformatics, 2010).
When we have several learning tasks,
solving them together could be better than
solving them individually since
common information could be shared by several related tasks.
This could be interpreted as performing domain adaptation over
several tasks.
Under the setting that (a small number of) input-output samples are given
for every tasks,
we developed a method to simultaneously train multiple support vector machines
by sharing common information
(IEEE Transactions on Knowledge and Data Engineering, 2010).
The importance is the ratio of two probability density functions.
The problem of importance estimation is to
estimate the importance from two sets of samples from
drawn from two probability distributions
(book: coming soon).
The values of the importance could be simply estimated
by estimating the probability density functions
and then taking the ratio of the estimated densities.
However, it is hard to estimate the density functions accurately
and therefore such a naive approach is not suitable.
We proposed a method called Kullback-Leibler importance estimation procedure (KLIEP),
which directly models the importance function
and does not involve density estimation
(Annals of the Institute of Statistical Mathematics, 2008), (software).
Then we developed extended KLIEP methods, log-linear KLIEP (LL-KLIEP)
(Journal of Information Processing, 2009),
Gaussian-mixture KLIEP (GM-KLIEP)
(IEICE Transactions on Information and Systems, 2009),
and
Probabilistic-PCA-mixture KLIEP (PM-KLIEP)
(IEICE Transactions on Information and Systems, 2010).
We also developed importance estimation methods based on the squaredloss,
least-squares importance fitting (LSIF) and unconstrained LSIF (uLSIF)
(Journal of Machine Learning Research, 2009),
(software).
In LSIF, the regularization path can be computed efficiently,
so the computation time including model selection can be significantly reduced.
In uLSIF, an importance estimator as well as the leave-one-out score can be calculated analytically,
so it is computationally very efficient.
We have also proved that uLSIF has an excellent stability property in terms of condition number analysis (coming soon).
For accurately estimating the density ratio in high-dimensional spaces,
dimensionality reduction is critical.
We named the subspace in which the numerator and denominator densities in the density ratio are different the hetero-distributional subspace,
and proposed a new framework for estimating the density ratio with the hetero-distributinal subspace search called the Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube).
We developed a computationally efficient D3 algorithm which combines
Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007)
and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009).
Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS), which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (SDM2010).
Density ratio estimation has various applications
such as
non-stationality adaptation,
outlier detection,
mutual information estimation,
and
conditional probability estimation.
See the review paper for details
(IPSJ Transactions, 2009),
(RIMS Kokyuroku, 2010),
(Proceedings of the Institute of Statistical Mathematics, 2010, in Japanese).
Mutual information is an important quantity in information theory
and is equivalent to the Kullback-Leibler divergence from a joint distribution
to the product of marginals.
Thus mutual information can be used for measuring statistical independence among random variables.
Based on density ratio estimators,
we proposed mutual information estimators
called maximum likelihood mutual information (MLMI)
(FSMI2008),
(software)
and its squared-loss variant
called least-squares mutual information (LSMI)
(BMC Bioinformatice, 2009),
(software).
Mutual information estimators can be used for various purposes such as
dimensionality reduction,
independent component analysis,
and causal inference.
Outlier detection is an
unsupervised learning problem
to find `irregular' samples in a data set.
Within a statistical formulation, a sample is regarded as an outlier
if its probability density value is smaller than a threshold.
Thus, if the probability density function of the data samples is identified,
the statistical outlier problem can be completely solved.
However, density estimation is known to be a hard task
particularly in high dimensional problems
and therefore we want to avoid density estimation in outlier detection.
We consider an outlier detection problem
where we want to identify outliers in a test data set based on
the training data set that do not contain outliers.
For this problem, we proposed an outlier detection method
that uses the importance as a novelty score.
Since the importance values could be obtained
without going through density estimation,
the proposed method works very well even in high dimensional problems
(Knowledge and Information Systems, 2011),
(software1),
(software2).
Independent component analysis (ICA) is a problem of
separating mixed signals into statistically independent ones,
and has been extensively studied in, e.g., speeach signal processing.
In the ICA research, how to evaluate statistical independence among signals is a key issue.
We proposed an ICA algorithm, called least-squares independent component analysis (LICA),
which is based on a mutual information approximator
using a density ratio estimator
(Neural Computation, 2011).
Given a paired random variable (X,Y),
it is very important to test whether X results in Y, Y results in X, or there is no causal relation between X and Y.
We proposed a causal inference algorithm, called least-squares independence regression (LSIR),
which is based on a mutual information approximator
using a density ratio estimator
(AAAI2010).
Estimating the conditional density when the condition variable is continuous
is not a straightforward task.
We proposed a method of estimating conditional densities
called the least-squares conditional density estimator (LSCDE)
(IEICE Transactions on Information and Systems, 2010)
which is useful especially for multi-dimensional continous variables.
LSCDE is based on a density ratio estimator.
A similar idea can be applied to probabilitic pattern recognition,
where the class-posterior probability is estimated in the classification scenario.
Our density-ratio based method, called least-squares probabilitic classifier (LSPC),
is faster than alternative approaches in training time by two orders of magnitude,
with pattern recognition accuracy
(IEICE Transactions on Information and Systems, 2010),
(software).
Standard supervised learning methods such as least-squares
learn the function such that the mean error for training samples is minimized.
This guarantees the mean generalization performance of learning machines,
but this does not necessarily mean that
the generalization performance for the single dataset at hand is high.
To appropriately handle such a risk,
it is necessary to design learning algorithm so that
not only the mean error, but also the error distribution is taken into account.
We proved that the boosting algorithm control the error distribution
properly and gave a new theoretical explanation for its success (COLT2008).
Furthermore, we proved that the nu-support vector machine optimizes an error measure called the conditional value-at-risk (CVaR) and elucidated its generalization performance theoretically (New Generation Computing, 2009).
Risk controls is also important in the framework of reinforcement learning for sequential decision making.
We developed reinforcement learning algorithms for optimizing various risk measures (IEICE Transactions on Information and Systems, 2010).
Furthermore, we proposed a new framework for estimating the sum of rewards obtained in the future and developed various algorithms for risk management (UAI2010), (ICML2010).
When a matrix is noisy or has missing values,
approximating the target matrix by a low-rank one
can give de-noised or missing-imputed matrix.
Since approximating the target matrix with a low-rank constraint
results in a non-convex optimization problem,
obtaining a good solution is difficult in practice.
To avoid this problem, the trace-norm regularizer is useful.
The trace-norm regularizer is a convex function,
and works as the l1-regularizer on singular values.
Thus, it tends to produce sparse singular values and thus a low-rank matrix
via convex optimization.
However, optimization with trace-norm regularizers is still challenging
for large-scale problems due to its non-differentiability.
To cope with this problem,
we proposed an efficient optimization algorithm for trace-norm regularization
based on dual augmented Lagrangian
(ICML2010).
Another popular formulation of low-rank approximation is matrix factorization:
a matrix is decomposed into the product of two `thin' matrices.
The matrix factorization model is non-identifiable,
meaning that the decomposition is redundant.
We showed that, in the variational Bayesian framework,
the non-identifiability of the matrix factorization model
induces a regularization effect even when the prior distributions are flat,
which is a notably different behavior from identifiable models
(ICML2010).
The solution of variational Bayesian matrix factorization is usually
computed by an iterative algorithm.
Although the optimization problem is non-convex,
we showed that the global optimal solution can be computed analytically
(NIPS2010).
We applied the subspace information criterion (SIC) in precipitation estimation
and won the first prize in prediction accuracy
at
2001 IEICE
Precipitation Estimation Contest
(Proceedings of the 2001 IEICE General Conference, 2001).
The problem of restoring degraded images by noise or blur
is mathematically equivalent to
supervised learning.
We formulated the problem of optimizing image restoration filters as
a model selection problem and proposed
a filter design method based on SIC
(IEICE Transactions on Information and Systems, 2001).
The paper received high praise because of its originality and usefulness
and we received the
Niwa Memorial Award
from Tokyo Denki University in 2002.
We further developed an integrated filter design method
(Signal Processing, 2002).
In image restoration, it is sometimes not sufficient to just obtain subjectively clear images;
for example, when the face recorded by a security camera is restored,
we should not add any subjective artifact in the image.
In such scenarios, it is effective to model the degradation process
and restore the image following the modeled process.
Based on this idea, we restored the image of
a historically important camera and contributed to
elucidating the Japanese camera history
(Technical Report, 2004).
Brain-Computer Interface (BCI) is a user interface that allows to control computers
with brain signals.
BCI allows paralyzed patients to use computers comfortably.
Even ordinary people can have a comfortable computer environment with the help
of brain signals.
Because of the time resolution, EEG signals are often used for BCI.
Namely, EEG signals which contain users' intention (e.g., left or right)
are classified into different control commands
by a supervised learning technique.
However, the non-stationarity of the brain causes a significant drift
in the EEG signals between training and test phases
so standard supervised learning methods
do not work appropriately.
We modeled the non-stationarity of the brain
as the covariate shift
and applied IWCV.
Then the BCI recognition accuracy has been highly improved
(Journal of Machine Learning Research, 2007),
(IEEE Transactions on Biomedical Engineering, 2010).
Optical interference microscopes allow to measure surface profile
at the scale of nanometers.
The surface profiling by optical interference microscopes
can be regarded as an example of
supervised learning.
Most of the existing surface profiling algorithms need to measure the interference
level multiple times with the phase of the light changed.
Surface profilers are often used in an environment with vibration such as factories.
In such environment, the measurement accuracy
degrades significantly if measurement is carried out multiple times.
To cope with this problem, we developed an efficient and accurate method
called local model fitting (LMF)
that allows us to recover the surface profile only from a single measurement
(Applied Optics, 2006).
We further provided theoretical error analysis of the LMF method
and proposed an improved method, interpolated LMF (iLMF)
(Applied Optics, 2009).
We also developed the iteratively-reweighted LMF (IRLMF),
which can adaptively determine the local regions in LMF
(Applied Optics, 2010).
Surface profiling algorithms based on monochromatic light sources,
including the above LMF method, generally have
a strong limitation on the maximum height of steps to be measured
due to the phase periodicity.
The maximum height can be extended by using multiple monochromatic light
sources with different wave length;
however, this requires multiple measurements and thus the advantage of
one-shot methods such as LMF is spoiled.
We developed a method that enables us to achieve this with a single measurement
(Journal of the Japan Society for Precision Engineering, 2009, in Japanese).
For this work, we won the Odawara Award 2nd Prize at the
Vision Engineering Workshop (ViEW2007) in 2007.
When the target object is covered by a film layer,
existing surface profiling algorithms do not give accurate results
because of the refraction; furthermore, the thickness of the film can not be measured.
To cope with this problem, we developed an algorithm that can
simultaneously measure the surface profile and the thickness
of the film:
white light
(Transactions of the Society of Instrument and Control Engineers, 2007, in Japanese),
multiple-shot monochromatic light
(Transactions of the Society of Instrument and Control Engineers, 2009, in Japanese),
and
one-shot monochromatic light
(Journal of the Japan Society for Precision Engineering, 2009, in Japanese).
Appropriately Designing basis functions in value function approximation
is an important problem of reinforcement learning.
We applied the geodesic Gaussian kernel (GGK)
to Khepera robot navigation tasks
and showed that the proposed method provides faster learning
than conventional method
(Autonomous Robots, 2008),
(demo).
We further showed that an update of robot control policies
through online learning processes could be formulated as
covariate shift.
We applied the importance weighted cross validation (IWCV) method
in value function approximation and proposed sample-reuse policy iteration (SRPI)
(Neural Networks, 2009).
We further applied a similar idea to a direct policy learning method
and proposed reward-weighted regression with sample reuse (R^3)
(ECML-PKDD2009).
Collecting data samples is often costly in real-world robotics
and therefore designing good sampling policy is highly important.
We applied our active learning method,
ALICE, to sampling policy design in robot control and proposed
active policy iteration (API)
(Neural Networks, 2010).
In practical robot-cotrol problemss, reward samples are often measured by sensors
and thus contain outliers.
We applied a robust regression technique to policy iteration
and proposed least absolute policy iteration (LAPI).
(IEICE Transactions on Information and Systems, 2010)
(demo).
Recent semi-conductors have the layered circuit structure, which are built by
exposing circuit patterns multiple times.
In this process, it is extremely important to align
the wafer at the same position with very high accuracy.
To this end, the location of markers
are measured to adjust the shift and rotation of wafers.
However, measuring the location of markers is time-consuming
and therefore there is a strong need to reduce the number of markers
to measure for speeding up the semi-conductor production process.
We applied our active learning method, PALICE, to the wafer alignment problem
and showed its usefulness
(Machine Learning, 2009).
The goal of speaker identification is
to predict who the speaker is from speech signals.
In speaker identification, non-stationarity caused by
changing environment and emotion is often a source of performance degradation.
We formulated speaker identification with non-stationarity
as classification under covariate shift,
and proposed a variant of logistic regression for non-stationarity adaptation
(Signal Processing, 2010).
The problem of word segmentation of Japanese sentences is
a fundamental task of natural language processing.
While Labeled samples for general conversation corpora are readily available,
data in specialized domains such as medical science are not easily available.
We applied the covariate shift adaptation techniques
and showed that the word segmentation accuracy of medical documents
can be improved by adapting the general conversation data to the medical domain
(Journal of Information Processing, 2009).
Document classification involves super-high dimensional sparse data,
so dimensionality reduction is essential.
We applied the semi-supervised dimensionality reduction method, SELF, to document classification,
and showd that the classification accuracy is highly improved
(IEICE Transactions on Information and Systems, 2009),
(Machine Learning, 2010),
(software).
We developed an efficient thesaurus construction method based on
unsupervised dimensionality reduction
(IEICE Transactions on Information and Systems, 2010).
Predicting age from face images has attracted
a great deal of attention these days
since it is useful for designing effective marketing strategies.
However, manually labeling face images is laborious,
so we want to utilize unlabeled image data
(i.e., semi-supervised learning).
We developed an active learning
strategy based on the cluster structure of unlabeled samples
and a regression method based on the characteristic
of human perception (IEICE Transactions on Information and Systems, 2010).
Face images possesses high diversity due to angles and lighting conditions,
so training and test images often follow different distributions.
We developed a new age prediction method based on the covariate shift adaptation techniques
(ICPR2010).
Sugiyama Lab.,
Department of Computer Science,
Graduate School of Information Science and Engineering
Tokyo Institute of Technology
2-12-1-W8-74, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan.
TEL & FAX: +81-3-5734-2699