Data lakes – Knowledge and References

Explore chapters and articles related to this topic

Data Lakes: A Panacea for Big Data Problems, Cyber Safety Issues, and Enterprise Security

Published in Mohiuddin Ahmed, Nour Moustafa, Abu Barkat, Paul Haskell-Dowland, Next-Generation Enterprise Security and Governance, 2022

A. N. M. Bazlur Rashid, Mohiuddin Ahmed, Abu Barkat Ullah

Service providers typically use the traditional approaches, such as a data warehouse to store data that is a single repository used to analyze data, consolidate information, and create reports. However, data transferring into a warehouse requires preprocessing. With zettabytes of data in cyberspace, this is not an easy task. Preprocessing requires a substantial amount of computing by high-end supercomputers, which costs time and money. Data lakes were proposed to solve this issue. Unlike data warehouses, data lakes store any type of raw data. Both data warehouses and data lakes can be considered as methods of storing and processing Big Data. However, data lakes are often considered a panacea for Big Data problems. The main challenges of Big Data that can be solved by data lake are storing and processing, analyzing heterogeneous data sources; either structured, semi-structured, and unstructured. Also, data privacy can be considered with data lake models to ensure the data security and privacy part. Therefore, data lakes are often considered as a panacea for Big Data problems. Accordingly, many organizations embrace data lakes for attempting to drive innovation and new services for users. To address all these issues, Big Data, data warehouse, data lakes, related cyber safety issues, and enterprise security are discussed in this chapter.

Track

View Chapter

Purchase Book

Published in Walter R. Paczkowski, Deep Data Analytics for New Product Development, 2020

Walter R. Paczkowski

A data store is the storage location closest to the source. It is temporary storage before the data are cleansed and loaded into the data warehouse which is a more encompassing and inclusive storage location for data. A data warehouse has a wide variety of types of data: financial, personnel, transaction, and so forth, all, of course, organized by topical and functional areas. A data lake is a variation of a data warehouse in that, like a warehouse, it stores a variety of data but these data are in their “native format” and can be structured and unstructured. They may come from social media, blogs, emails, sensors, and so forth. The costs of maintaining these data in a lake are lower than for a warehouse because the storage arrangements are less restrictive. The data lake, however, has other costs beyond those associated with maintenance. The primary cost is the level of preprocessing that has to be applied to the data from a lake that will be used in an analytical process. Since the data in a lake are unprocessed, by definition, and direct from their source, they will first have to be processed, cleaned, checked, and wrangled (i.e., merged with other data) before they could be used in the analytical process. See Lemahieu et al. [2018] on processing costs and Kazil and Jarmul [2016] for insight into the concept and complexities of data wrangling.

Cloud Computing, Data Sources and Data Centers

View Chapter

Purchase Book

Published in Diego Galar Pascual, Pasquale Daponte, Uday Kumar, Handbook of Industry 4.0 and SMART Systems, 2019

Diego Galar Pascual, Pasquale Daponte, Uday Kumar

The term data repository can be used to describe several ways to collect and store data: A data warehouse is a large data repository that aggregates data usually from multiple sources or segments of a business, without the data being necessarily related.A data lake is a large data repository that stores unstructured data that are classified and tagged with metadata.Data marts are subsets of the data repository. These data marts are more targeted to what the data user needs and easier to use. Data marts are also more secure because they limit authorized users to isolated data sets. These users cannot access all the data in the data repository.Metadata repositories store data about data and databases. The metadata explains where the data source is, how it was captured, and what it represents.Data cubes are lists of data with three or more dimensions stored as a table—as you may find in a spreadsheet.

Vehicle system dynamics in digital twin studies in rail and road domains

View Article

Journal Information

Published in Vehicle System Dynamics, 2023

Maksym Spiryagin, Johannes Edelmann, Florian Klinger, Colin Cole

A framework in particular tailored for connecting multiple mobility entities together is presented in [57]. The authors describe their so-called ‘Mobility Digital Twin’ as an ‘Artificial Intelligence (AI)-based data-driven cloud-edge-device framework for mobility services’. The framework basically comprises ‘Human’, ‘Vehicle’, and ‘Traffic’ entities in both the physical space and their associated digital twins in the digital space, as well as the communication between the entities, e.g. by using Internet of Things (IoT) and/or Internet of Vehicles (IoV) technologies [59,60]. A ‘data lake’ is considered in the digital space as ‘a centralized repository that allows structured or unstructured data at any scale to be stored’. The resulting data can be used for micro-services for an individual entity, but also for ‘cooperative control’ or ‘parallel control’ of multiple entities [61]. An example implementation of the framework with cloud-edge computing is discussed in detail to provide an idea of deploying the digital twin concept in the real world, building on Amazon Web Services (AWS) and other (commercial) software tools. The four-layer architecture is shown in Figure 8.

Advances in the UK Toward NDE 4.0

View Article

Journal Information

Published in Research in Nondestructive Evaluation, 2020

N. Brierley, R. A. Smith, N. Turner, R. Culver, T. Maw, A. Holloway, O. Jones, P. D. Wilcox

In particular, an environment that enables rapid process development must make full use of modelling, simulation and the vast quantities of data available. Extensive connectivity and efficient data handling are required to expose the data to users and applications. The lack of any all-encompassing data model applicable across the whole AM process chain has led to the implementation of a data lake [11]. The data lake is a repository of data stored in a native state, where, alongside traditional database rows and columns, semi-structured data (for example CSV or JSON files) and unstructured data (such as PDFs and images) can be stored. Data from across the process chain (build files, build logs, powder test data, inspection data, inspection reports, operator production data packs etc.) are captured and input to the data lake through a single interface. No schema or format is imposed on the data upon import; the only requirement is that there is context attached to the data so that it is clear where the data have come from, what part this relates to, how and when this was generated, and who the owner is. This enables easy identification of raw data related to a given component by a simple search against these context data entries. While no strict schema compliance is expected for the raw data, a schema is imposed on the context data, to ensure consistency.