Rapidly load data from anywhere, in any format, at any time
Anzo Smart Data Lake (ASDL) connects to both internal and external data sources –including cloud or on-premise Hadoop based data lakes – to rapidly ingest and catalog large volumes of structured and unstructured data through horizontally scaled, automated Extract, Transform and Load (ETL) processes that can be mapped to establish a Semantic Layer of business meaning…
Able to sustain extremely high parallel load rates, Anzo Smart Data Lake (ASDL) adds enormous amounts of rich data to Enterprise Knowledge Graphs or tabular targets in just minutes. ASDL ingests most structured data without manual mapping, automatically creating a graph model from the data structure or logical model – enriched by any available metadata data dictionaries or taxonomies. Capabilities that enable collaborative mapping allow analysts to create additional transformations and add business meaning, as well as relationships during data movement. ASDL may also ingest to non-graph targets – using the Semantic Layer as a business understandable canonical model in a virtual hub-and-spoke ETL to move and transform data between data environments such as data warehouses or Apache Hive. Point-to-point Apache Spark jobs that require no programming are generated automatically from reusable mappings between all the sources, targets and the Semantic Layer.
Unlike other approaches that only flatten data for the benefit of Big Data tools, ASDL ingestion capabilities preserve the multi-dimensional data model sourced from upstream applications and relational databases. Unstructured data is processed in parallel through configurable text analytics and Natural Language Processing (NLP) pipelines and harmonized with data from multiple sources in the knowledge graph.
Ingested graph data or target tables come to rest in scalable shared file storage – HDFS, cloud buckets, NFS or Apache Hive. Virtual data sets are another attractive capability for those organizations wary of duplicating data – pulling on-demand from data sources and ingesting directly into memory for analytics, only as needed.
From data source to dashboard, users traverse the full data provenance and lineage of all data in the catalog in a user-friendly, visual interface. ASDL automatically captures extensive schema and statistical metadata describing data sources right down to the field level as a precursor to ingestion and this information guides data preparers and data consumers as context in the data catalog searching or during analytics.
Because ASDL is an open data platform, driven by open standards based models, organizations can extensively augment the kinds of metadata they capture to more fully describe their sources of data in terms that make sense to them and their business or regulatory driven processes. This can be done both manually or using automated means like Machine Learning or the ingestion of existing metadata data sources to capture all additional information that might be useful to both data preparers and data consumers. Additional metadata might include quality or completeness metrics, flags for PII and GDPR, references to applicable regulations like HIPPA, data usage metrics, collaborative information, confidentiality level indicators, and up-stream data steward contact details.
All captured data source metadata is recorded in ASDL’s metadata graph store and therefore may either be used alone, or in combination with any data in the data lake, as a target for data analytics.