Explore chapters and articles related to this topic
Feature Engineering for Text Data
Published in Guozhu Dong, Huan Liu, Feature Engineering for Machine Learning and Data Analytics, 2018
Chase Geigle, Qiaozhu Mei, ChengXiang Zhai
Another possibility is to use a feature set consisting of all possible subtrees encountered in some training data set of parsed sentences. This can be seen as generalizing the rewrite rule to a depth greater than one. Unfortunately, this will produce an exponential number of features, so clever techniques for computing the dot product of the feature vectors that would be induced for two parse trees (also called a tree kernel) have been designed [9,55]. Considering only the structure of the subtrees is a way of reducing the feature space [46]. Essentially, the idea is to consider subtree features where the syntactic category and surface words are deleted. Thus, two trees that are structurally the same but have different internal nodes would be collapsed down into the same feature. One can also consider keeping only certain syntactic categories; a semi-skeleton feature keeps only the root syntactic category and eliminates all of the others.
Malware Threat Analysis of IoT Devices Using Deep Learning Neural Network Methodologies
Published in Sudhir Kumar Sharma, Bharat Bhushan, Bhuvan Unhelkar, Security and Trust Issues in Internet of Things, 2020
Moksh Grover, Nikhil Sharma, Bharat Bhushan, Ila Kaushik, Aditya Khamparia
Yasaswi et al. [21] proposed to extract similarity features based on compiler infrastructure from intermediate code generation. Furthermore, to measure the similarity between the source codes, unsupervised learning was used, whereas plagiarism was detected by similar functionalities depicted by contrasting source codes. In Ref. [22], in java source codes, software benchmark was used to compare the source codes to compute the similarities for threat detection. By running source codes, it captures its structural characteristics by selecting the control flow information. In Ref. [23], plagiarism in student’s assignments was identified using the latent semantic analysis where it was combined with PlaGate. This combination was used to audit the linguistic parallels between various documents. Based on the parse tree, a syntax tree was drawn from any given source code. On the basis of their syntax tree, different source codes could be compared. Cosma et al. [24] developed a Source Forager search engine that fetched various properties from the code example, such as functionalities between C++ and C codes as a feedback to the user questions, and processed them in the shape of “k” number of functionalities from the corpus. The developed software could detect software resemblance and the logic was the conceptual structure of the program. Kashyap et al. [25] extracted similar texts using the parse tree kernel method between various java source codes. In core functionalities, there were irregular variations of nodes due to which this technique did not produce a better outcome. Therefore, to extract the resemblance between various source codes, the fingerprinting method was designed. In Ref. [26], to compute the behavior of dissimilarity between various source codes, a logic-based approach was employed. To obtain semantics for dissimilarities from execution paths, symbolic execution and precondition reasoning were used, and if there were no dissimilarities, then it originated in the plagiarism problem. A detailed summary of the types of analysis for malware detection is explored in the subsections below.
Detecting logical argumentation in text via communicative discourse tree
Published in Journal of Experimental & Theoretical Artificial Intelligence, 2018
Boris Galitsky, Dmitry Ilvovsky, Sergey O. Kuznetsov
Tree Kernel learning for strings, parse trees and parse thickets is a well-established research area nowadays (Castellucci, Vanzo, Croce, & Basili, 2015). The parse tree kernel counts the number of common sub-trees as the discourse similarity measure between two instances. In (Wang et al., 2010), the authors used the special form of tree kernels for discourse relation recognition. In this study, we extend the tree kernel definition for the CDT, augmenting DT kernel by the information on communicative actions. A CDT can be represented by a vector V of integer counts of each sub-tree type (without taking into account its ancestors). The terms for Communicative Actions as labels are converted into trees which are added to respective nodes for RST relations. For texts for EDUs as labels for terminal nodes, only the phrase structure is retained: we label the terminal nodes with the sequence of phrase types instead of parse tree fragments.