Curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming.
Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion. Even in the simplest case of binary variables, the number of possible combinations already is , exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.
There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 102 = 100 evenly spaced sample points suffice to sample a unit interval (a "1-dimensional cube") with no more than 10−2 = 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2 = 0.01 between adjacent points would require 1020 = [(102)10] sample points. In general, with a spacing distance of 10−n the 10-dimensional hypercube appears to be a factor of 10n(10−1) = [(10n)10/(10n)] "larger" than the 1-dimensional hypercube, which is the unit interval. In the above example n = 2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.
When solving dynamic optimization problems by numerical backward induction, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large.
In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values.
A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation. In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon, which is also known as Hughes phenomenon. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.
Nevertheless, in the context of a simple classifier (linear discriminant analysis in the multivariate Gaussian model under the assumption of a common known covariance matrix) Zollanvari et al. showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.
When a measure such as a Euclidean distance is defined using many coordinates, there is little difference in the distances between different pairs of samples.
One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius and dimension , to that of a hypercube with edges of length The volume of such a sphere is , where is the gamma function, while the volume of the cube is . As the dimension of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be seen by comparing the proportions as the dimension goes to infinity:
- as .
Furthermore, the distance between the center and the corners is , which increases without bound for fixed r. In this sense, nearly all of the high-dimensional space is "far away" from the centre. In high dimensions, the volume of the d-dimensional unit hypercube (with coordinates of the vertices ) is concentrated near a sphere with the radius for large dimension d. Indeed, for each coordinate the average value of in the cube is
The variance of for uniform distribution in the cube is
Therefore, the squared distance from the origin, has the average value d/3 and variance 4d/45. For large d, distribution of is close to the normal distribution with the mean 1/3 and the standard deviation according to the central limit theorem. Thus, in high dimensions, both the "middle" of the hypercube, and the corners are empty, and all the volume is concentrated near a sphere of the "intermediate" radius .
This also helps to understand the chi-squared distribution. Indeed, the (non-central) chi-squared distribution associated to a random point in the interval [-1, 1] is the same as the distribution of the length-squared of a random point in the d-cube. By the law of large numbers, this distribution concentrates itself in a narrow band around d times the standard deviation squared (σ2) of the original derivation. This illuminates the chi-squared distribution and also illustrates that most of the volume of the d-cube concentrates near the surface of a sphere of radius .
A further development of this phenomenon is as follows. Any fixed distribution on the real numbers induces a product distribution on points in ℝd. For any fixed n, it turns out that the minimum and the maximum distance between a random reference point Q and a list of n random data points P1,...,Pn become indiscernible compared to the minimum distance:
This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions ℝ are independent and identically distributed. When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.
Nearest neighbor search
The effect complicates nearest neighbor search in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.
However, it has recently been observed that the mere number of dimensions does not necessarily result in difficulties, since relevant additional dimensions can also increase the contrast. In addition, for the resulting ranking it remains useful to discern close and far neighbors. Irrelevant ("noise") dimensions, however, reduce the contrast in the manner described above. In time series analysis, where the data are inherently high-dimensional, distance functions also work reliably as long as the signal-to-noise ratio is high enough.
k-nearest neighbor classification
Another effect of high dimensionality on distance functions concerns k-nearest neighbor (k-NN) graphs constructed from a data set using a distance function. As the dimension increases, the indegree distribution of the k-NN digraph becomes skewed with a peak on the right because of the emergence of a disproportionate number of hubs, that is, data-points that appear in many more k-NN lists of other data-points than the average. This phenomenon can have a considerable impact on various techniques for classification (including the k-NN classifier), semi-supervised learning, and clustering, and it also affects information retrieval.
In a 2012 survey, Zimek et al. identified the following problems when searching for anomalies in high-dimensional data:
- Concentration of scores and distances: derived values such as distances become numerically similar
- Irrelevant attributes: in high dimensional data, a significant number of attributes may be irrelevant
- Definition of reference sets: for local methods, reference sets are often nearest-neighbor based
- Incomparable scores for different dimensionalities: different subspaces produce incomparable scores
- Interpretability of scores: the scores often no longer convey a semantic meaning
- Exponential search space: the search space can no longer be systematically scanned
- Data snooping bias: given the large search space, for every desired significance a hypothesis can be found
- Hubness: certain objects occur more frequently in neighbor lists than others.
Many of the analyzed specialized methods tackle one or another of these problems, but there remain many open research questions.
Blessing of dimensionality
Surprisingly and despite the expected "curse of dimensionality" difficulties, common-sense heuristics based on the most straightforward methods "can yield results which are almost surely optimal" for high-dimensional problems. The term "blessing of dimensionality" was introduced in the late 1990s. Donoho in his "Millennium manifesto" clearly explained why the "blessing of dimensionality" will form a basis of future data mining. The effects of the blessing of dimensionality were discovered in many applications and found their foundation in the concentration of measure phenomena. One example of the blessing of dimensionality phenomenon is linear separability of a random point from a large finite random set with high probability even if this set is exponentially large: the number of elements in this random set can grow exponentially with dimension. Moreover, this linear functional can be selected in the form of the simplest linear Fisher discriminant. This separability theorem was proven for a wide class of probability distributions: general uniformly log-concave distributions, product distributions in a cube and many other families (reviewed recently in ).
"The blessing of dimensionality and the curse of dimensionality are two sides of the same coin." For example, the typical property of essentially high-dimensional probability distributions in a high-dimensional space is: the squared distance of random points to a selected point is, with high probability, close to the average (or median) squared distance. This property significantly simplifies the expected geometry of data and indexing of high-dimensional data (blessing), but, at the same time, it makes the similarity search in high dimensions difficult and even useless (curse).
Zimek et al. noted that while the typical formalizations of the curse of dimensionality affect i.i.d. data, having data that is separated in each attribute becomes easier even in high dimensions, and argued that the signal-to-noise ratio matters: data becomes easier with each attribute that adds signal, and harder with attributes that only add noise (irrelevant error) to the data. In particular for unsupervised data analysis this effect is known as swamping.
- Bellman, Richard Ernest; Rand Corporation (1957). Dynamic programming. Princeton University Press. p. ix. ISBN 978-0-691-07951-6.,
Republished: Bellman, Richard Ernest (2003). Dynamic Programming. Courier Dover Publications. ISBN 978-0-486-42809-3.
- Bellman, Richard Ernest (1961). Adaptive control processes: a guided tour. Princeton University Press. ISBN 9780691079011.
- Taylor, C. Robert (1993). "Dynamic Programming and the Curses of Dimensionality". Applications Of Dynamic Programming To Agricultural Decision Problems. Westview Press. pp. 1–10. ISBN 0-8133-8641-1.
- Koutroumbas, Konstantinos; Theodoridis, Sergios (2008). Pattern Recognition (4th ed.). Burlington. ISBN 978-1-59749-272-0. Retrieved 8 January 2018.
- Hughes, G.F. (January 1968). "On the mean accuracy of statistical pattern recognizers". IEEE Transactions on Information Theory. 14 (1): 55–63. doi:10.1109/TIT.1968.1054102. S2CID 206729491.
- Trunk, G. V. (July 1979). "A Problem of Dimensionality: A Simple Example". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (3): 306–307. doi:10.1109/TPAMI.1979.4766926. PMID 21868861.
- B. Chandrasekaran; A. K. Jain (1974). "Quantization Complexity and Independent Measurements". IEEE Transactions on Computers. 23 (8): 102–106. doi:10.1109/T-C.1974.223789.
- McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience. ISBN 978-0-471-69115-0. MR 1190469.
- A. Zollanvari; A. P. James; R. Sameni (2020). "A Theoretical Analysis of the Peaking Phenomenon in Classification". Journal of Classification. 37 (2): 421–434. doi:10.1007/s00357-019-09327-3.
- Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. (1999). When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory - ICDT'99. LNCS. 1540. pp. 217–235. doi:10.1007/3-540-49257-7_15. ISBN 978-3-540-65452-0.
- Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363–387. doi:10.1002/sam.11161.
- Marimont, R.B.; Shapiro, M.B. (1979). "Nearest Neighbour Searches and the Curse of Dimensionality". IMA J Appl Math. 24 (1): 59–70. doi:10.1093/imamat/24.1.59.
- Chávez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; Marroquín, José Luis (2001). "Searching in Metric Spaces". ACM Computing Surveys. 33 (3): 273–321. CiteSeerX 10.1.1.100.7845. doi:10.1145/502807.502808.
- Houle, M. E.; Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? (PDF). Scientific and Statistical Database Management. Lecture Notes in Computer Science. 6187. p. 482. doi:10.1007/978-3-642-13818-8_34. ISBN 978-3-642-13817-1.
- Bernecker, T.; Houle, M. E.; Kriegel, H. P.; Kröger, P.; Renz, M.; Schubert, E.; Zimek, A. (2011). Quality of Similarity Rankings in Time Series. Symposium on Spatial and Temporal Databases. Lecture Notes in Computer Science. 6849. p. 422. doi:10.1007/978-3-642-22922-0_25. ISBN 978-3-642-22921-3.
- Radovanović, Miloš; Nanopoulos, Alexandros; Ivanović, Mirjana (2010). "Hubs in space: Popular nearest neighbors in high-dimensional data" (PDF). Journal of Machine Learning Research. 11: 2487–2531.
- Radovanović, M.; Nanopoulos, A.; Ivanović, M. (2010). On the existence of obstinate results in vector space models. 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10. p. 186. doi:10.1145/1835449.1835482. ISBN 9781450301534.
- Kainen, Paul C. (1997), "Utilizing Geometric Anomalies of High Dimension: When Complexity Makes Computation Easier", in Kárný, M.; Warwick, K. (eds.), Computer Intensive Methods in Control and Signal Processing, pp. 283–294, doi:10.1007/978-1-4612-1996-5_18
- Donoho, David L. (2000), "High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality", Invited lecture at Mathematical Challenges of the 21st Century, AMS National Meeting, Los Angeles, CA, USA, August 6-12, 2000, CiteSeerX 10.1.1.329.3392
- Gorban, Alexander N.; Makarov, Valery A.; Tyukin, Ivan Y. (2020). "High-Dimensional Brain in a High-Dimensional World: Blessing of Dimensionality". Entropy. 22 (1): 82. arXiv:2001.04959. Bibcode:2020Entrp..22...82G. doi:10.3390/e22010082. PMC 7516518. PMID 33285855.
- Gorban, Alexander N.; Tyukin, Ivan Y. (2018). "Blessing of dimensionality: mathematical foundations of the statistical physics of data". Phil. Trans. R. Soc. A. 376 (2118): 20170237. arXiv:1801.03421. Bibcode:2018RSPTA.37670237G. doi:10.1098/rsta.2017.0237. PMC 5869543. PMID 29555807.
- Hecht-Nielsen, Robert (1994), "Context vectors: general-purpose approximate meaning representations self-organized from raw data", in Zurada, J.M.; Marks, R.J.; Robinson, C.J. (eds.), Computational intelligence: imitating life; Proceedings of World Congress on Computational Intelligence, Neural Networks; 1994; Orlando; FL, Piscataway, NJ: IEEE Press, pp. 43–56, ISBN 0780311043
- Pestov, Vladimir (2013). "Is the k-NN classifier in high dimensions affected by the curse of dimensionality?". Comput. Math. Appl. 65 (10): 43–56. doi:10.1016/j.camwa.2012.09.011.