Explore chapters and articles related to this topic
Data Lakes: A Panacea for Big Data Problems, Cyber Safety Issues, and Enterprise Security
Published in Mohiuddin Ahmed, Nour Moustafa, Abu Barkat, Paul Haskell-Dowland, Next-Generation Enterprise Security and Governance, 2022
A. N. M. Bazlur Rashid, Mohiuddin Ahmed, Abu Barkat Ullah
The raw datasets created by the data ingestion are in specific data formats, such as binary encodings or textual. Data extraction transforms the raw data into a predefined data model for different purposes, for example, data discovery, cleaning, and integration. Data cleaning can be performed by CLAMS that unify heterogeneous data from the data lake into RDF. Similarly, for data discovery, table extraction can be used. The abstraction of data into attributes for indexing allowing efficient data discovery can be performed table extraction. Data extraction tools, such as DeepDive, extract the relational data from the data lakes consisting of tables, texts, and images bases on user-defined schemas and rules. Google Web Table project is an example of an automatic data extraction tool that combines statistically trained classifiers and hand-written heuristics. Google Web Table is used for detecting relational tables from HTML tables and assigning synthetic headers when required. For extracting relational data from semi-structured log files, DATAMARAN can be used. The declarative data description language, such as PADS, also parses and extracts data files and a compiler and tools.
Modern Predictive Analytics and Big Data Systems Engineering
Published in Anna M. Doro-on, Handbook of Systems Engineering and Risk Management in Control Systems, Communication, Space Technology, Missile, Security and Defense Operations, 2023
Many people are confused when distinguishing between data extraction and data mining, but there is ample difference between data mining and data extraction. In general, data extraction means acquiring data from one data source and processing it into a designated database. Data extraction, also known as data scraping, usually involves the process of retrieving data for further processing such as clustering or segmentation analysis (Mena 2011). It can also involve the extraction of unstructured data (text) for further processing via a structured type of analytical tool; this may involve some transformation and possibly the addition of metadata (Mena 2011). Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files, etc. (Mena 2011). Nevertheless, one can obtain data from a legacy system to incorporate it into a standard data warehouse. Data mining (also synonymous with knowledge discovery) is the extraction of vague or concealed predictive information and looking for patterns from large data warehouses. The knowledge discovery process is an iterative sequence of the following steps (Han Et al. 2012): (a) data cleaning, to remove noise and inconsistent data; (b) data integration, where multiple data sources may be combined; (c) data selection, where data relevant to the analysis task are retrieved from the database; (d) data mining, an essential process where intelligent methods are applied to extract data patter€ (e) pattern evaluation, to identify the truly interesting patterns representing knowledge based on interesting measures; and (f) knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users.
Analysis and comparison of turbulence models on wind turbine performance using SCADA data and machine learning technique
Published in Cogent Engineering, 2023
Jui-Hung Liu, Jien-Chen Chen, Nelson T. Corbita
Data extraction is the process of gathering or retrieving various types of data from a variety of sources, many of which are unstructured or poorly organized. Data extraction allows for the consolidation, processing, and refinement of data before it is stored in a centralized location and transformed. The SCADA system’s user interface could only provide historical data for the last two years.