Explore chapters and articles related to this topic
A review of DNA sequence data analysis technologies and their combination with data mining methods
Published in Lin Liu, Automotive, Mechanical and Electrical Engineering, 2017
Tiange Yu, Yang Chen, Bowen Zhang
DNA sequence patterns carries important biological meaning in the survival and evolution of species, and the identification of such patterns is a crucial task for DNA sequence data analysis. Researchers have developed sequence pattern mining algorithms which are highly efficient and suitable to analyze large data sets. Sequence pattern mining was firstly proposed by Agrawal and Srikant in 1995 (R. Agrawal and R. Srikant, 1995). Then in 1996, Srikant et al. described the Generalized Sequential Patterns mining (GSP) algorithm (R. Srikant and R. Agrawal, 1996). The GSP algorithm introduced time and conceptual constraint and searches the patterns within the dataset with a bottom-up Breadth-First-Search (BFS) method. However, it has the disadvantage that it generates a large set of patterns which reduces the overall efficiency. Regarding to this issue, Pei et al. proposed the PrefixSpan algorithm which is based on growing pattern sets (J. Han, 2001). The PrefixSpan method reduces the computational complexity by dividing the overall data set into smaller subsets and performing data mining on these individual subsets.
Predicting Mood from Digital Footprints Using Frequent Sequential Context Patterns Features
Published in International Journal of Human–Computer Interaction, 2023
Muhammad Johan Alibasa, Rafael A. Calvo, Kalina Yacef
Alibasa and Calvo (2019) built a mood prediction model from digital activities using duration and frequency data. The study showed that the model achieved maximum accuracies up to 82% when the digital activity records are complete (at least 55 minutes data are available within 60-minute window prior to mood sampling). However, the model could only achieved up to 68.9% accuracy when the digital activity records are not complete, e.g., at least 30 minutes data are available within 60-minute window (Alibasa et al., 2019). The machine learning features were derived from the duration information of popular digital categories (e.g. email, writing, games, etc.) and the approximated task-switching occurrences. This mood prediction system was then considerably improved (accuracy up to 80%) by considering frequent sequential patterns using a GSP algorithm and a customized pre-processing (Alibasa et al., 2019). The pre-processing method grouped all digital activity data into 5-minute digital activity activity data bucket as shown in Figure 1. This study found sequential patterns occurring more frequently prior positive mood reports and others prior non-positive mood ones, and reported a higher performance in predicting mood using these sequential patters as features, compared to models with duration and frequency information only. In this study, any activity occurring in the activity data bucket was treated equally, regardless of their respective duration within that bucket. For example in Figure 1, the first bucket of 5-minute data consists of 250 seconds of Writing, 40 seconds of Email and 10 seconds of Utilities, and the second bucket consists of 270 seconds of Email and 30 seconds of Writing. GSP algorithm generates all possible sequential patterns by ignoring the duration information. This is because the algorithm treats the series of activity data buckets in the activity records as a partially ordered list. The generated sequences can include digital activities that have short usage duration, for instance (Utilities, Writing) and (Email, Writing). Yet, the Writing category is more dominant in the first bucket compared with the other digital activities and the patterns generated by the GSP algorithm do not represent the duration proportion inside that activity data bucket.