Seleção de variáveis para clusterização de bateladas produtivas através de ACP e remapeamento kernel
Clustering variable selection for grouping production batches through PCA and kernel mapping
Cervo, Victor Leonardo; Anzanello, Michel José
http://dx.doi.org/10.1590/0103-6513.143613
Production, vol.25, n4, p.823-833, 2015
PlumX Metrics

- Citations
- Citation Indexes: 1
- Usage
- Full Text Views: 2384
- Abstract Views: 317
- Captures
- Readers: 6
Downloads: 0
Views: 1135
Resumo
Técnicas de clusterização visam à formação de grupos de observações homogêneas dentro de um mesmo grupo e significativamente distintas das observações inseridas em outros grupos. Em processos industriais cuja produção é apoiada em bateladas, a definição de famílias (grupos) de bateladas com perfis semelhantes auxilia na definição de estratégias de controle e monitoramento desses processos. Este artigo propõe um método para seleção das variáveis de clusterização mais relevantes para formação de famílias de bateladas. Para tanto, integra funções kernel a um novo índice de importância de variáveis gerado a partir dos parâmetros oriundos da Análise de Componentes Principais (ACP). A qualidade dos agrupamentos formados é avaliada através do Silhouette Index (SI). Quando aplicada em três processos produtivos, a sistemática proposta reteve em média 5,16% das variáveis iniciais e elevou o SI médio em 235,4% frente à utilização de todas as variáveis. Um estudo de simulação também é realizado para avaliar a robustez do método.
Palavras-chave
Análise de clusterização. Seleção de variáveis. Kernel. Processos em batelada.
Abstract
Clustering techniques are tailored to find internally homogeneous groups of observations. In industrial processes that rely on batches, grouping batches with similar profiles provides valuable information about process control and monitoring. This paper proposes a variable selection approach based on the kernel function and Principal Component Analysis (PCA). The clustering quality is assessed through the Silhouette Index (SI). When applied to three industrial processes, the proposed approach retained an average of 5.16% of the original variables, yielding on average a 235.4% more precise batch grouping. We also performed a simulation experiment.
Keywords
Clustering analysis. Variable selection. Kernel. Batch processes.
References
Abe, S. (2010). Support Vector Machines for Pattern Recognition (2nd ed.). London: Springer-Verlag. http://dx.doi.org/10.1007/978-1-84996-098-4
Agard, B., & Penz, B. (2009). A simulated annealing method based on a clustering approach to determine bills of materials for a large product family. International Journal of Production Economics, 117(2), 389-401. http://dx.doi.org/10.1016/j.ijpe.2008.12.004
Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons, Inc.
Anzanello, M. J., & Fogliatto, F. S. (2011). Selecting the best clustering variables for grouping mass-customized products involving workers’ learning. International Journal of Production Economics, 130(2), 268-276. http://dx.doi.org/10.1016/j.ijpe.2011.01.009
Baghshah, M. S., & Shouraki, S. B. (2011). Learning low-rank kernel matrices for constrained clustering. Neurocomputing, 74, 2201-2211. http://dx.doi.org/10.1016/j.neucom.2011.02.009
Bessaoud, F., Tretarre, B., Daurès, J. P., & Gerber, M. (2012). Identification of dietary patterns using two statistical approaches and their association with breast cancer risk: a case-control study in southern France. Annals of Epidemiology, 22(7), 499-510. PMid:22571994. http://dx.doi.org/10.1016/j.annepidem.2012.04.006
Bouveyron, C., Girard, S., & Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, 52, 502-519. http://dx.doi.org/10.1016/j.csda.2007.02.009
Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510-523. PMid:15598102. http://dx.doi.org/10.1037/1082-989X.9.4.510
Brusco, M. J., & Cradit, J. D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika, 66(2), 249-270. http://dx.doi.org/10.1007/BF02294838
Dean, N., & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics, 62(1), 11-35. PMid:20827439 PMCid:PMC2934856. http://dx.doi.org/10.1007/s10463-009-0258-9
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310. http://dx.doi.org/10.1016/0002-9149(89)90524-9
Domenicone, C., Peng, J., & Yan, B. (2011). Composite kernels for semi-supervised clustering. Knowledge and Information Systems, 28(1), 99-116. http://dx.doi.org/10.1007/s10115-010-0318-8
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley-Interscience.
Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176-190. http://dx.doi.org/10.1016/j.patcog.2007.05.018
Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society, Series B, 66, 815-849. http://dx.doi.org/10.1111/j.1467-9868.2004.02059.x
Gauchi, J. P., & Chagnon, P. (2001). Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data. Chemometrics Intelligent Laboratory Systems, 58, 171-193. http://dx.doi.org/10.1016/S0169-7439(01)00158-7
Girolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE Transactions on Neual Networks, 13(3), 780-784. PMid:18244475. http://dx.doi.org/10.1109/TNN.2002.1000150
Gnanadesikan, R., Kettenring, J., & Tsao, S. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12(1), 113-136. http://dx.doi.org/10.1007/BF01202271
Hair, J., Anderson, R., Tatham, R. & Black, W. (1995). Multivariate Data Analysis with Readings (4th ed.). New Jersey: Prentice-Hall Inc.
Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657-668. PMid:15875789. http://dx.doi.org/10.1109/TPAMI.2005.95
Huang, T., Kecman, V., & Kopriva, I. (2006). Kernel based algorithms for mining huge data sets, Supervised, Semi-supervised, and Unsupervised learning. Berlin: Springer-Verlag.
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). New York: Springer-Verlag.
Kaufman, L., & Rousseeuw, P. (2005). Finding Groups in Data: an Introduction to Cluster Analysis. New Jersey: Wiley Interscience.
Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 29(1), 10-18. http://dx.doi.org/10.1016/j.patrec.2007.08.012
Maugis, C., Celeux, G., & Martin-Magniette, M. (2009). Variable selection for clustering with Gaussian mixture models. Biometrics, 65(3), 701-709. PMid:19210744. http://dx.doi.org/10.1111/j.1541-0420.2008.01160.x
Meek, C., Thiesson, B., & Heckerman, D. (2002). The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2, 397-418.
Milligan, G. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325-342. http://dx.doi.org/10.1007/BF02293907
Milligan, G., & Cooper, M. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204. http://dx.doi.org/10.1007/BF01897163
Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168-178. http://dx.doi.org/10.1198/016214506000000113
Rousseeuw, P. (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65. http://dx.doi.org/10.1016/0377-0427(87)90125-7
Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyound. Cambridge: The MIT Press.
Steinley, D. (2004). Standardizing variables in K-means clustering. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53-60). New York: Springer. http://dx.doi.org/10.1007/978-3-642-17103-1_6
Steinley, D., & Brusco, M. J. (2008a). A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research, 43(1), 77-108. http://dx.doi.org/10.1080/00273170701836695
Steinley, D., & Brusco, M. J. (2008b). Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika, 73(1), 125-144. http://dx.doi.org/10.1007/s11336-007-9019-y
Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques do diagnose breast cancer from fine-niddle aspirates. Cancer Letters, 77, 163-171. http://dx.doi.org/10.1016/0304-3835(94)90099-X
Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics.Chemometrics Intelligent Laboratory Systems, 58(2), 109-130. http://dx.doi.org/10.1016/S0169-7439(01)00155-1
Agard, B., & Penz, B. (2009). A simulated annealing method based on a clustering approach to determine bills of materials for a large product family. International Journal of Production Economics, 117(2), 389-401. http://dx.doi.org/10.1016/j.ijpe.2008.12.004
Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons, Inc.
Anzanello, M. J., & Fogliatto, F. S. (2011). Selecting the best clustering variables for grouping mass-customized products involving workers’ learning. International Journal of Production Economics, 130(2), 268-276. http://dx.doi.org/10.1016/j.ijpe.2011.01.009
Baghshah, M. S., & Shouraki, S. B. (2011). Learning low-rank kernel matrices for constrained clustering. Neurocomputing, 74, 2201-2211. http://dx.doi.org/10.1016/j.neucom.2011.02.009
Bessaoud, F., Tretarre, B., Daurès, J. P., & Gerber, M. (2012). Identification of dietary patterns using two statistical approaches and their association with breast cancer risk: a case-control study in southern France. Annals of Epidemiology, 22(7), 499-510. PMid:22571994. http://dx.doi.org/10.1016/j.annepidem.2012.04.006
Bouveyron, C., Girard, S., & Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, 52, 502-519. http://dx.doi.org/10.1016/j.csda.2007.02.009
Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510-523. PMid:15598102. http://dx.doi.org/10.1037/1082-989X.9.4.510
Brusco, M. J., & Cradit, J. D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika, 66(2), 249-270. http://dx.doi.org/10.1007/BF02294838
Dean, N., & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics, 62(1), 11-35. PMid:20827439 PMCid:PMC2934856. http://dx.doi.org/10.1007/s10463-009-0258-9
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310. http://dx.doi.org/10.1016/0002-9149(89)90524-9
Domenicone, C., Peng, J., & Yan, B. (2011). Composite kernels for semi-supervised clustering. Knowledge and Information Systems, 28(1), 99-116. http://dx.doi.org/10.1007/s10115-010-0318-8
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley-Interscience.
Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176-190. http://dx.doi.org/10.1016/j.patcog.2007.05.018
Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society, Series B, 66, 815-849. http://dx.doi.org/10.1111/j.1467-9868.2004.02059.x
Gauchi, J. P., & Chagnon, P. (2001). Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data. Chemometrics Intelligent Laboratory Systems, 58, 171-193. http://dx.doi.org/10.1016/S0169-7439(01)00158-7
Girolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE Transactions on Neual Networks, 13(3), 780-784. PMid:18244475. http://dx.doi.org/10.1109/TNN.2002.1000150
Gnanadesikan, R., Kettenring, J., & Tsao, S. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12(1), 113-136. http://dx.doi.org/10.1007/BF01202271
Hair, J., Anderson, R., Tatham, R. & Black, W. (1995). Multivariate Data Analysis with Readings (4th ed.). New Jersey: Prentice-Hall Inc.
Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657-668. PMid:15875789. http://dx.doi.org/10.1109/TPAMI.2005.95
Huang, T., Kecman, V., & Kopriva, I. (2006). Kernel based algorithms for mining huge data sets, Supervised, Semi-supervised, and Unsupervised learning. Berlin: Springer-Verlag.
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). New York: Springer-Verlag.
Kaufman, L., & Rousseeuw, P. (2005). Finding Groups in Data: an Introduction to Cluster Analysis. New Jersey: Wiley Interscience.
Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 29(1), 10-18. http://dx.doi.org/10.1016/j.patrec.2007.08.012
Maugis, C., Celeux, G., & Martin-Magniette, M. (2009). Variable selection for clustering with Gaussian mixture models. Biometrics, 65(3), 701-709. PMid:19210744. http://dx.doi.org/10.1111/j.1541-0420.2008.01160.x
Meek, C., Thiesson, B., & Heckerman, D. (2002). The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2, 397-418.
Milligan, G. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325-342. http://dx.doi.org/10.1007/BF02293907
Milligan, G., & Cooper, M. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204. http://dx.doi.org/10.1007/BF01897163
Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168-178. http://dx.doi.org/10.1198/016214506000000113
Rousseeuw, P. (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65. http://dx.doi.org/10.1016/0377-0427(87)90125-7
Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyound. Cambridge: The MIT Press.
Steinley, D. (2004). Standardizing variables in K-means clustering. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53-60). New York: Springer. http://dx.doi.org/10.1007/978-3-642-17103-1_6
Steinley, D., & Brusco, M. J. (2008a). A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research, 43(1), 77-108. http://dx.doi.org/10.1080/00273170701836695
Steinley, D., & Brusco, M. J. (2008b). Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika, 73(1), 125-144. http://dx.doi.org/10.1007/s11336-007-9019-y
Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques do diagnose breast cancer from fine-niddle aspirates. Cancer Letters, 77, 163-171. http://dx.doi.org/10.1016/0304-3835(94)90099-X
Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics.Chemometrics Intelligent Laboratory Systems, 58(2), 109-130. http://dx.doi.org/10.1016/S0169-7439(01)00155-1