- Types of Learning
- Theories and Algorithms of Machine Learning
- Model Selection
- Active Learning
- Incremental Learning / Online Learning
- Dimensionality Reduction
- Learning from Similarity Data / Kernel Methods
- Learning under Different Distributions
- Importance Estimation / Density Ratio Estimation
- Mutual Information Estimation / Divergence Estimation
- Outlier Detection
- Independent Component Analysis
- Causal Inference
- Conditional Probability Estimation
- Risk-sensitive Learning
- Low-rank Approximation

- Applications of Machine Learning

In semi-supervised learning, the number of input-output samples is typically small, while the number of input-only samples is very large. In such situations, the generalization performance may be further improve by utilizing a large number of input-only samples in addition to a small number of input-output samples.

A typical approach to reinforcement learning does not directly learn the policy function itself; it learns the value function, a mapping from a state (and an action) to the sum of rewards. The desired policy function can be generated based on the value function. (book in Japanese: Training Robotic Game Players by Reinforcement Learning), (book: coming soon).

Model selection is one of the most fundamental and central research topics in supervised learning. Model selection is generally carried out by finding the model such that the generalization performance is maximized. Therefore, the key of model selection research is how to derive an accurate estimator of the generalization performance.

The above two-stage method is practically useful. However, it is an incremental method, i.e., the input location is determined one by one in a sequential manner. Therefore, it can only achieve the greedy optimality. Theoretically, batch active learning---designing all input locations at the same time---is globally optimal. Following this fact, we developed a globally optimal batch active learning method when the learning target function is included in trigonometric polynomial models, and elucidated its theoretical properties and geometric structure (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001). For this achievement, we received the Young Engineer Award from The Institute of Electronics, Information and Communication Engineers (IEICE) in 2002.

In the above methods, the model at hand is assumed to be correct. However, this assumption may not be satisfied in practice. Therefore, developing an active learning method applicable to misspecified models (i.e., the learning target function is not included in the model) is practically very important--but is known to be a hard problem since it is difficult to control the bias of misspecified models. For misspecified models, importance-weighted least-squares learning is proved to give an asymptotic unbiased estimator. Based on this fact, we proposed a batch active learning method called ALICE (Journal of Machine Learning Research, 2006). ALICE is guaranteed to give a minimum-variance design of input locations under the constraint that the bias is asymptotically minimized. ALICE has been applied to efficient design of sampling policies in robot control.

ALICE is an effective active learning method when we are allowed to set the input points at any desired location. In some practical situations, however, we are only given a finite number of input points and have to choose the best input points among the candidate points. Such a scenario is called pool-based active learning. We extended ALICE to pool-based ALICE (P-ALICE) and showed its usefulness through extensive simulations (Machine Learning, 2009). PALICE has been applied to semi-conductor wafer alignment in in semiconductor exposure apparatus.

For this challenging problem, we proved that there exist the common optimal training input location for trigonometric polynomial models. Based on this result, we developed an algorithm for active learning with model selection for trigonometric polynomial models that can successfully avoid the model selection/active learning dilemma (IEICE Transactions on Information and Systems, 2003).

Furthermore, we extended the above method to accommodate a general class of models and developed the method of ensemble active learning (Neural Networks, 2009).

We developed an incremental learning method called incremental projection learning (IPL), that only relies on incremental computation but provides exactly the same learning result as that obtained by batch learning with all training samples (Neural Networks, 2001a). We also elucidated theoretical properties of IPL (Neural Networks, 2001b). IPL forms a basis of developing incremental active learning algorithms. Furthermore, we proposed an IPL-based method of constructing neural networks (IEICE Transactions on Information and Systems, 2002). For the above series of works on incremental learning, the Inose Science Promotion Award was given by the Foundation of Electrical, Electronics, and Information Science Promotion in 1998.

Online learning is also important in the context of density ratio estimation. We developed an online version of the density ratio estimation algorithm KLIEP, and showed its effectiveness through change-point detection experiments (SDM2009).

For semi-supervised classification scenarios, we developed semi-supervised local Fisher discriminant analysis (SELF) by combining principal component analysis (unsupervised dimensionality reduction) and local Fisher discriminant analysis (supervised dimensionality reduction) (Machine Learning, 2010), (software). Furthermore, for very high-dimensional sparse massive datasets, we developed a computational efficient implementation called sparse SELF (SSELF), and showed its usefulness in document classification, an important task in natural language processing (IEICE Transactions on Information and Systems, 2009), (software).

To cope with this problem, we proposed a general semi-parametric framework of unsupervised dimensionality reduction called non-Gaussian component analysis (NGCA). Within the NGCA framework, we proposed an algorithm called multi-index projection pursuit (MIPP) (Journal of Machine Learning Research, 2006). MIPP allows us to employ a number of projection indices at the same time. Therefore, one can reduce the dimensionality of the data even when it contains both super- and sub-Gaussian components.

Although MIPP can overcome the traditional weakness of PP, it is not still the optimal method within the NGCA framework. We further developed an algorithm called the iterative metric adaptation for radial kernel functions (IMAK) (Annals of the Institute of Statistical Mathematics, 2007). IMAK can estimate the signal subspace more accurately than MIPP and PP.

We also applied the NGCA method to the noise reduction problem and developed an algorithm for accurately estimating the best linear unbiased estimator (IEICE Transactions on Information and Systems, 2008).

The nu support vector machine is a powerful kernel-based classificaiton algorithm. However, under strict formulation, the nu support vector machine involves a non-convex optimization problem and there is no known method for obtaining its global optimal solution. Based on cutting plane mehtods, we developed a new optimization algorithm which is guaranteed to find the global optimal solution (New Generation Computing, 2009).

We also published a review paper on recent advances in large-scale kernel methods (IEICE Transactions on Information and Systems, 2009).

Sparse learning methods are incresingly popular these days due to its high generalization performance with a large number of explanatory variables. We developed an augmented Lagrangian algorithm for efficient training of sparse regularization models (IEEE Signal Processing Letters, 2009)

Gaussian kernels are general-purpose similarity functions and give good results on the whole. However, if some prior knowledge of the problem domain is available, the generalization performance and computational efficiency could be improved by utilizing the prior knowledge.

We proposed the principal component kernel (PCK) which can be computed based on the prior distribution of the learning target function (IEICE Transactions on Information and Systems, 2006). We constructed a specific PC kernel for the binary regression problem and showed its effectiveness.

Kernel functions are also often employed in value function approximation in reinforcement learning. We proposed the geodesic Gaussian kernel (GGK), which can be constructed based on the manifold structure of the state space. We demonstrated the effectiveness of GGKs in robot arm control and Khepera robot navigation tasks (Autonomous Robots, 2008), (demo).

We co-organized the NIPS2006 Workshop on this issue and edited a book (Dataset Shift in Machine Learning). Furthermore, we published a monograph on covariate shift adaptation (book: coming soon).

The unbiasedness of model selection criteria including the subspace information criterion (SIC) is no longer maintained under the covariate shift. To cope with this problem, we proposed an extended SIC called importance weighted SIC (IWSIC) that is still unbiased even under the covariate shift (Statistics & Decisions, 2005).

IWSIC is specialized for the squared loss function, so is not applicable to classification tasks. Cross validation is a standard model selection technique in classification scenarios, but its unbiasedness is lost under the covariate shift. We therefore developed an extended cross validation method called importance weighted cross validation (IWCV) that works desirably in covariate shifted classification tasks (Journal of Machine Learning Research, 2007).

Furthermore, we theoretically elucidated the effect of covariate shift on Baysian inference with singular models (ICML2007).

The effectiveness of covariate shift adaptation methods has been demonstrated in various real-world problems such as brian-computer interface, robot control, speech processing, natural language processing, and computer vision.

For the above series of works on covariate shift adaptation, the IBM Faculty Award is given by in 2007.

We consider a restriction where the training input-output distribution is a mixture of two distributions and the test input-output distribution is one of the two distributions. Under this restriction, we developed a learning method that can learn the test function given input-output training samples and input-only test samples (NIPS2006).

We developed a transfer learning method which enforces the target solution to be similar to related source solutions, and showed its effectiveness in biological network inference (International Journal of Knowledge Discovery in Bioinformatics, 2010).

Under the setting that (a small number of) input-output samples are given for every tasks, we developed a method to simultaneously train multiple support vector machines by sharing common information (IEEE Transactions on Knowledge and Data Engineering, 2010).

The values of the importance could be simply estimated by estimating the probability density functions and then taking the ratio of the estimated densities. However, it is hard to estimate the density functions accurately and therefore such a naive approach is not suitable. We proposed a method called Kullback-Leibler importance estimation procedure (KLIEP), which directly models the importance function and does not involve density estimation (Annals of the Institute of Statistical Mathematics, 2008), (software). Then we developed extended KLIEP methods, log-linear KLIEP (LL-KLIEP) (Journal of Information Processing, 2009), Gaussian-mixture KLIEP (GM-KLIEP) (IEICE Transactions on Information and Systems, 2009), and Probabilistic-PCA-mixture KLIEP (PM-KLIEP) (IEICE Transactions on Information and Systems, 2010). We also developed importance estimation methods based on the squaredloss, least-squares importance fitting (LSIF) and unconstrained LSIF (uLSIF) (Journal of Machine Learning Research, 2009), (software). In LSIF, the regularization path can be computed efficiently, so the computation time including model selection can be significantly reduced. In uLSIF, an importance estimator as well as the leave-one-out score can be calculated analytically, so it is computationally very efficient. We have also proved that uLSIF has an excellent stability property in terms of condition number analysis (coming soon).

For accurately estimating the density ratio in high-dimensional spaces, dimensionality reduction is critical. We named the subspace in which the numerator and denominator densities in the density ratio are different the hetero-distributional subspace, and proposed a new framework for estimating the density ratio with the hetero-distributinal subspace search called the Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube). We developed a computationally efficient D3 algorithm which combines Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007) and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009). Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS), which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (SDM2010).

Density ratio estimation has various applications such as non-stationality adaptation, outlier detection, mutual information estimation, and conditional probability estimation. See the review paper for details (IPSJ Transactions, 2009), (RIMS Kokyuroku, 2010), (Proceedings of the Institute of Statistical Mathematics, 2010, in Japanese).

Mutual information estimators can be used for various purposes such as dimensionality reduction, independent component analysis, and causal inference.

We consider an outlier detection problem where we want to identify outliers in a test data set based on the training data set that do not contain outliers. For this problem, we proposed an outlier detection method that uses the importance as a novelty score. Since the importance values could be obtained without going through density estimation, the proposed method works very well even in high dimensional problems (Knowledge and Information Systems, 2011), (software1), (software2).

We proved that the boosting algorithm control the error distribution properly and gave a new theoretical explanation for its success (COLT2008). Furthermore, we proved that the nu-support vector machine optimizes an error measure called the conditional value-at-risk (CVaR) and elucidated its generalization performance theoretically (New Generation Computing, 2009).

Risk controls is also important in the framework of reinforcement learning for sequential decision making. We developed reinforcement learning algorithms for optimizing various risk measures (IEICE Transactions on Information and Systems, 2010). Furthermore, we proposed a new framework for estimating the sum of rewards obtained in the future and developed various algorithms for risk management (UAI2010), (ICML2010).

Since approximating the target matrix with a low-rank constraint results in a non-convex optimization problem, obtaining a good solution is difficult in practice. To avoid this problem, the trace-norm regularizer is useful. The trace-norm regularizer is a convex function, and works as the l1-regularizer on singular values. Thus, it tends to produce sparse singular values and thus a low-rank matrix via convex optimization. However, optimization with trace-norm regularizers is still challenging for large-scale problems due to its non-differentiability. To cope with this problem, we proposed an efficient optimization algorithm for trace-norm regularization based on dual augmented Lagrangian (ICML2010).

Another popular formulation of low-rank approximation is matrix factorization: a matrix is decomposed into the product of two `thin' matrices. The matrix factorization model is non-identifiable, meaning that the decomposition is redundant. We showed that, in the variational Bayesian framework, the non-identifiability of the matrix factorization model induces a regularization effect even when the prior distributions are flat, which is a notably different behavior from identifiable models (ICML2010). The solution of variational Bayesian matrix factorization is usually computed by an iterative algorithm. Although the optimization problem is non-convex, we showed that the global optimal solution can be computed analytically (NIPS2010).

In image restoration, it is sometimes not sufficient to just obtain subjectively clear images; for example, when the face recorded by a security camera is restored, we should not add any subjective artifact in the image. In such scenarios, it is effective to model the degradation process and restore the image following the modeled process. Based on this idea, we restored the image of a historically important camera and contributed to elucidating the Japanese camera history (Technical Report, 2004).

We modeled the non-stationarity of the brain as the covariate shift and applied IWCV. Then the BCI recognition accuracy has been highly improved (Journal of Machine Learning Research, 2007), (IEEE Transactions on Biomedical Engineering, 2010).

Most of the existing surface profiling algorithms need to measure the interference level multiple times with the phase of the light changed. Surface profilers are often used in an environment with vibration such as factories. In such environment, the measurement accuracy degrades significantly if measurement is carried out multiple times. To cope with this problem, we developed an efficient and accurate method called local model fitting (LMF) that allows us to recover the surface profile only from a single measurement (Applied Optics, 2006). We further provided theoretical error analysis of the LMF method and proposed an improved method, interpolated LMF (iLMF) (Applied Optics, 2009). We also developed the iteratively-reweighted LMF (IRLMF), which can adaptively determine the local regions in LMF (Applied Optics, 2010).

Surface profiling algorithms based on monochromatic light sources, including the above LMF method, generally have a strong limitation on the maximum height of steps to be measured due to the phase periodicity. The maximum height can be extended by using multiple monochromatic light sources with different wave length; however, this requires multiple measurements and thus the advantage of one-shot methods such as LMF is spoiled. We developed a method that enables us to achieve this with a single measurement (Journal of the Japan Society for Precision Engineering, 2009, in Japanese). For this work, we won the Odawara Award 2nd Prize at the Vision Engineering Workshop (ViEW2007) in 2007.

When the target object is covered by a film layer, existing surface profiling algorithms do not give accurate results because of the refraction; furthermore, the thickness of the film can not be measured. To cope with this problem, we developed an algorithm that can simultaneously measure the surface profile and the thickness of the film: white light (Transactions of the Society of Instrument and Control Engineers, 2007, in Japanese), multiple-shot monochromatic light (Transactions of the Society of Instrument and Control Engineers, 2009, in Japanese), and one-shot monochromatic light (Journal of the Japan Society for Precision Engineering, 2009, in Japanese).

We further showed that an update of robot control policies through online learning processes could be formulated as covariate shift. We applied the importance weighted cross validation (IWCV) method in value function approximation and proposed sample-reuse policy iteration (SRPI) (Neural Networks, 2009). We further applied a similar idea to a direct policy learning method and proposed reward-weighted regression with sample reuse (R^3) (ECML-PKDD2009).

Collecting data samples is often costly in real-world robotics and therefore designing good sampling policy is highly important. We applied our active learning method, ALICE, to sampling policy design in robot control and proposed active policy iteration (API) (Neural Networks, 2010).

In practical robot-cotrol problemss, reward samples are often measured by sensors and thus contain outliers. We applied a robust regression technique to policy iteration and proposed least absolute policy iteration (LAPI). (IEICE Transactions on Information and Systems, 2010) (demo).

Document classification involves super-high dimensional sparse data, so dimensionality reduction is essential. We applied the semi-supervised dimensionality reduction method, SELF, to document classification, and showd that the classification accuracy is highly improved (IEICE Transactions on Information and Systems, 2009), (Machine Learning, 2010), (software).

We developed an efficient thesaurus construction method based on unsupervised dimensionality reduction (IEICE Transactions on Information and Systems, 2010).

Face images possesses high diversity due to angles and lighting conditions, so training and test images often follow different distributions. We developed a new age prediction method based on the covariate shift adaptation techniques (ICPR2010).

Sugiyama Lab.,

Department of Computer Science,

Graduate School of Information Science and Engineering

Tokyo Institute of Technology

2-12-1-W8-74, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan.

TEL & FAX: +81-3-5734-2699