TOPIC GROUPING BASED ON DESCRIPTION TEXT IN MICROSOFT RESEARCH VIDEO DESCRIPTION CORPUS DATA USING FASTTEXT, PCA AND K-MEANS CLUSTERING

Authors

  • Ahmad Hafidh Ayatullah
  • Nanik Suciati

DOI:

https://doi.org/10.33795/jip.v9i2.1271

Keywords:

Microsoft Research Video Description Corpus, FastText, PCA, K-means, silhouette coefficient

Abstract

Video data retrieval can be done based on voice, image, or text data that represents video content. Searching for videos using text data can be done by calculating the similarity between the text descriptions provided by the user and the text descriptions of all the video data in the database. Only video data with a certain level of similarity will be provided to the user as a fetch result. Determining the similarity of the description text can be based on the clustering results of the feature representation of the description text with the word embedding used. This research groups topics of the Microsoft Research Video Description Corpus (MRVDC) based on text descriptions of Indonesian language dataset. The Microsoft Research Video Description Corpus (MRVDC) is a video dataset developed by Microsoft Research, which contains paraphrased event expressions in English and other languages. The results of grouping these topics show how the patterns of similarity and interrelationships between text descriptions from different video data, which will be useful for the topic-based video retrieval. The topic grouping process is based on text descriptions using fastText as word embedding, PCA as features reduction method and K[1]means as the clustering method. The experiment on 1959 videos with 43753 text descriptions to vary the number of k and with/without PCA result that the optimal clustering number is 180 with silhouette coefficient of 0.123115. The optimal clustering results in this study can be used for video data retrieval systems in the Indonesian language MRVDC dataset.

Downloads

Download data is not yet available.

Downloads

Published

2023-02-28

How to Cite

Ayatullah , A. H., & Suciati , N. . (2023). TOPIC GROUPING BASED ON DESCRIPTION TEXT IN MICROSOFT RESEARCH VIDEO DESCRIPTION CORPUS DATA USING FASTTEXT, PCA AND K-MEANS CLUSTERING. Jurnal Informatika Polinema, 9(2), 223–228. https://doi.org/10.33795/jip.v9i2.1271