Полиномиальные аппроксимационные схемы для задач выбора векторов и кластеризации с разными центрами

Пяткин А. В.

Дискретный анализ и исследование операций, 30, 3, 96-110 (2023)

УДК 519.8+518.25
DOI: 10.33048/daio.2023.30.763

Аннотация:

Рассматриваются две близкие в постановочном плане задачи. Во-первых, общая задача кластеризации, т. е. разбиения множества $d$-мерных евклидовых векторов на заданное число кластеров с разными типами центров, при котором суммарная дисперсия будет минимальной. Под дисперсией понимается сумма квадратов расстояний между элементами кластера и его центром. При этом для одной части кластеров центр может быть выбран произвольно (очевидно, что в этом случае следует выбрать центроид), для другой части в качестве центра должен быть выбран один из векторов исходного множества, а для остальных кластеров центрами являются заранее заданные точки. Также на входе задаются размеры каждого кластера. Вторая рассматриваемая задача — это задача выбора подмножества векторов заданной мощности с минимальной суммой квадратов расстояний от его элементов до центроида. В статье построены полиномиальные аппроксимационные схемы (PTAS) для обеих задач.
Библиогр. 23.

Литература:

Berkhin P. A survey of clustering data mining techniques // Grouping multidimensional data: Recent advances in clustering. Heidelberg: Springer, 2006. P. 25–71.
Jain A. K., Dubes R. C. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall, 1988.
Ghoreyshi S., Hosseinkhani J. Developing a clustering model based on $K$-means algorithm in order to creating different policies for policyholders // Int. J. Adv. Comput. Sci. Inf. Tech. 2015. V. 4, No. 2. P. 46–53.
Fisher W. D. On grouping for maximum homogeneity // J. Am. Stat. Assoc. 1958. V. 53, No. 284. P. 789–798.
MacQueen J. Some methods for classification and analysis of multivariate observations // Proc. 5th Berkeley Symp. Mathematics, Statistics and Probability (Berkeley, USA, June 21 — July 18, 1965; Dec. 27, 1965 —Jan. 7, 1966). V. 1. Berkeley: Univ. California Press, 1967. P. 281–297.
Kaufman L., Rousseeuw P. J. Clustering by means of medoids // Statistical data analysis based on the $L_1$-norm. Amsterdam: North-Holland, 1987. P. 405–416.
Aloise D., Deshpande A., Hansen P., Popat P. NP-hardness of Euclidean sum-of-squares clustering // Mach. Learn. 2009. V. 75, No. 2. P. 245–248.
Inaba M., Katoh N., Imai H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering // Proc. 10th Annu. Symp. Computational Geometry (Stony Brook, NY, USA, June 6–8, 1994). New York: ACM Press, 1994. P. 332–339.
Kumar A., Sabharwal Y., Sen S. A simple linear time $(1+\epsilon)$-approximation algorithm for geometric $k$-means clustering in any dimensions // Proc. 45th Annu. IEEE Symp. Foundations of Computer Science (Rome, Italy, Oct. 17–19, 2004). Los Alamitos, CA: IEEE Comp. Soc., 2004. P. 454–462.
Бабурин А. Е., Гимади Э. Х., Глебов Н. И., Пяткин А. В. Задача отыскания подмножества векторов с максимальным суммарным весом // Дискрет. анализ и исслед. операций. Сер. 2. 2007. Т. 14, № 1. С. 32–42.
Гимади Э. Х., Кельманов А. В., Кельманова М. А., Хамидуллин С. А. Апостериорное обнаружение в числовой последовательности квазипериодического фрагмента при заданном числе повторов // Сиб. журн. индустр. математики. 2006. Т. 9, № 1. С. 55–74.
Кельманов А. В., Пяткин А. В. О сложности одного из вариантов задачи выбора подмножества «похожих» векторов // Докл. Акад. наук. 2008. Т. 421, № 5. С. 590–592.
Кельманов А. В., Пяткин А. В. Об одном варианте задачи выбора подмножества векторов // Дискрет. анализ и исслед. операций. 2008. Т. 15, № 5. С. 20–34.
Долгушев А. В., Кельманов А. В. Приближённый алгоритм решения одной задачи кластерного анализа // Дискрет. анализ и исслед. операций. 2011. Т. 18, № 2. С. 29–40.
Кельманов А. В., Хандеев В. И. Полиномиальный алгоритм с оценкой точности 2 для решения одной задачи кластерного анализа // Дискрет. анализ и исслед. операций. 2013. Т. 20, № 4. С. 36–45.
Долгушев А. В., Кельманов А. В., Шенмайер В. В. Полиномиальная аппроксимационная схема для одной задачи разбиения конечного множества на два кластера // Тр. Ин-та математики и механики. 2015. Т. 21, № 3. С. 100–109.
Кельманов А. В., Пяткин А. В., Хандеев В. И. NP-трудность квадратичной евклидовой задачи 2-кластеризации 1-mean и 1-median с ограничениями на размеры кластеров // Докл. Акад. наук. 2019. Т. 489, No. 4. С. 339–343.
Pyatkin A. V. 1-Mean and 1-medoid 2-clustering problem with arbitrary cluster sizes: Complexity and approximation // Yugoslav J. Oper. Res. 2023. V. 33, No. 1. P. 59–69.
Шенмайер В. В. Аппроксимационная схема для одной задачи поиска подмножества векторов // Дискрет. анализ и исслед. операций. 2012. Т. 19, № 2. С. 93–101.
Галашов А. Е., Кельманов А. В. 2-Приближённый алгоритм для одной задачи поиска семейства непересекающихся подмножеств векторов // Автоматика и телемеханика. 2014. № 4. С. 5–19.
Edmonds J., Karp R. M. Theoretical improvements in algorithmic efficiency for network flow problems // J. ACM. 1972. V. 19, No. 2. P. 248–264.
Gabow H. N., Tarjan R. E. Faster scaling algorithms for network problems // SIAM J. Comput. 1989. V. 18, No. 5. P. 1013–1036.
Wirth H. Algorithms + data structures = programs. Englewood Cliffs, NJ: Prentice Hall, 1976.

Исследование выполнено в рамках государственного задания ИМ СО РАН (проект № FWNF–2022–0019).

Пяткин Артём Валерьевич

Институт математики им. С. Л. Соболева,
пр. Акад. Коптюга, 4, 630090 Новосибирск, Россия

E-mail: artempyatkin@gmail.com

Статья поступила 7 февраля 2023 г.
После доработки — 24 апреля 2023 г.
Принята к публикации 25 апреля 2023 г.

Abstract:

Two close in statements problems are considered. The first one is clustering, i. e. partitioning the set of $d$-dimensional Euclidean vectors into the given number of clusters with different types of centers so that the total dispersion would be minimum. By dispersion here we mean the sum of squared distances between the elements of the clusters and their centers. There are three types of centers: an arbitrary point (clearly, the centroid is the best choice), a point of the initial set (so-called medoid) or a fixed point of the space given in advance. The sizes of the clusters are also given as a part of the input. The second problem is the vector subset choice problem, which is finding a subset of vectors of fixed cardinality having the minimum sum of squared distances between its elements and the centroid. For each of these problems a PTAS is constructed.
Bibliogr. 23.

References:

P. Berkhin, A survey of clustering data mining techniques, in Grouping Multidimensional Data: Recent Advances in Clustering (Springer, Heidelberg, 2006), pp. 25–71.
A. K. Jain and R. C. Dubes, Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ, 1988).
S. Ghoreyshi and J. Hosseinkhani, Developing a clustering model based on $K$-means algorithm in order to creating different policies for policyholders, Int. J. Adv. Comput. Sci. Inf. Tech. 4 (2), 46–53 (2015).
W. D. Fisher, On grouping for maximum homogeneity, J. Am. Stat. Assoc. 53 (284), 789–798 (1958).
J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematics, Statistics and Probability, Berkeley, USA, June 21 — July 18, 1965; Dec. 27, 1965 — Jan. 7, 1966, Vol. 1 (Univ. California Press, Berkeley, 1967), pp. 281–297.
L. Kaufman and P. J. Rousseeuw, Clustering by means of medoids, in Statistical Data Analysis Based on the $L_1$-Norm (North-Holland, Amsterdam, 1987), pp. 405–416.
D. Aloise, A. Deshpande, P. Hansen, and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn. 75 (2), 245–248 (2009).
M. Inaba, N. Katoh, and H. Imai, Applications of weighted Voronoi diagrams and randomization to variance-based $k$-clustering, in Proc. 10th Annu. Symp. Computational Geometry, Stony Brook, NY, USA, June 6–8, 1994 (ACM Press, New York, 1994), pp. 332–339.
A. Kumar, Y. Sabharwal, and S. Sen, A simple linear time $(1 + \epsilon)$-approximation algorithm for geometric $k$-means clustering in any dimensions, in Proc. 45th Annu. IEEE Symp. Foundations of Computer Science, Rome, Italy, Oct. 17–19, 2004 (IEEE Comp. Soc., Los Alamitos, CA, 2004), pp. 454–462.
A. E. Baburin, É. Kh. Gimadi, N. I. Glebov, and A. V. Pyatkin, The problem of finding a subset of vectors with the maximum total weight, Diskretn. Anal. Issled. Oper., Ser. 2, 14 (1), 32–42 (2007) [Russian] [J. Appl. Ind. Math. 2 (1), 32–38 (2008)].
É. Kh. Gimadi, A. V. Kel’manov, M. A. Kel’manova, and S. A. Khamidullin, A posteriori detection of a quasiperiodic fragment with a given number of repetitions in a numerical sequence, Sib. Zh. Ind. Mat. 9 (1), 55–74 (2006) [Russian].
A. V. Kel’manov and A. V. Pyatkin, On the complexity of a search for a subset of «similar» vectors, Dokl. Akad. Nauk 421 (5), 590–592 (2008) [Russian] [Dokl. Math. 78 (1), 574–575 (2008)].
A. V. Kel’manov and A. V. Pyatkin, On a version of the problem of choosing a vector subset, Diskretn. Anal. Issled. Oper. 15 (5), 20–34 (2008) [Russian] [J. Appl. Ind. Math. 3 (4), 447–455 (2009)].
A. V. Dolgushev and A. V. Kel’manov, An approximation algorithm for solving a problem of cluster analysis, Diskretn. Anal. Issled. Oper. 18 (2), 29–40 (2011) [Russian] [J. Appl. Ind. Math. 5 (4), 551–558 (2011)].
A. V. Kel’manov and V. I. Khandeev, A 2-approximation polynomial algorithm for a clustering problem, Diskretn. Anal. Issled. Oper. 20 (4), 36–45 (2013) [Russian] [J. Appl. Ind. Math. 7 (4), 515–521 (2013)].
A. V. Dolgushev, A. V. Kel’manov, and V. V. Shenmaier, A polynomial-time approximation scheme for a problem of partitioning a finite set into two clusters, Tr. Inst. Mat. Mekh. 21 (3), 100–109 (2015) [Russian] [Proc. Steklov Inst. Math. 295 (Suppl. 1), 47–56 (2016)].
A. V. Kel’manov, A. V. Pyatkin, and V. I. Khandeev, NP-hardness of quadratic Euclidean 1-mean and 1-median 2-clustering problem with constraints on the cluster sizes, Dokl. Akad. Nauk 489 (4), 339–343 (2019) [Russian] [Dokl. Math. 100 (3), 545–548 (2019)].
A. V. Pyatkin, 1-Mean and 1-medoid 2-clustering problem with arbitrary cluster sizes: Complexity and approximation, Yugoslav J. Oper. Res. 33 (1), 59–69 (2023).
V. V. Shenmaier, An approximation scheme for a problem of search for a vector subset, Diskretn. Anal. Issled. Oper. 19 (2), 93–101 (2012) [Russian] [J. Appl. Ind. Math. 6 (3), 381–386 (2012)].
A. E. Galashov and A. V. Kel’manov, A 2-approximate algorithm to solve one problem of a family of disjoint vector subsets, Avtom. Telemekh., No. 4, 5–19 (2014) [Russian] [Autom. Remote Control 75 (4), 595–606 (2014)].
J. Edmonds and R. M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems, J. ACM 19 (2), 248–264 (1972).
H. N. Gabow and R. E. Tarjan, Faster scaling algorithms for network problems, SIAM J. Comput. 18 (5), 1013–1036 (1989).
H. Wirth, Algorithms + Data Structures = Programs (Prentice Hall, Englewood Cliffs, NJ, 1976).

электронное сетевое издание
«Дискретный анализ и исследование операций»

Полиномиальные аппроксимационные схемы для задач выбора векторов и кластеризации с разными центрами