Lightning Catalog is a fast, lightweight and intuitive Spark based data catalog for the preparing data at any scale for ad-hoc analytics, data warehouse, lake house and ML projects.
Move data from source and legacy systems to target state while continuing the business.
A single view of all your data transformed into one unified semantic model and business language.
Enable Data Science Workloads
Lightning Catalog can remove the burden of data preparation workload for ML engineer, and help them focusing on building model.
Simplified Pipeline Execution Engine
Lightning Catalog simplify the life cycle of data engineering pipe line, build, test and deploy by leveraging Data Flow Table
Discover and register all your source metadata information
Accelerate your data transformations using basic SQL queries
Distribute data by connecting upstream to downstream via secure JDBC/ODBC Connections
Fully Managed Catalog built in file systems (HDFS, Blob, and local file) which allows version control.
Support Apache Spark Plug-in architecture.
Support running data pipeline at MPP scale by leveraging Apache Spark and optional NVIDIA GPU
Support running ANSI SQL and Hive QL over source systems defined in the Catalog
Support multiple namespace.
Support data quality by integrating Amazon Deequ.
Support data flow table, declarative ETL framework which defines and transforms your data.
Support metadata processing for unstructured data using endpoint declarations.
Latest Supported Data Sources
... and many other compatible data sources
Spark Jdbc (postgres, mysql), Spark Mongodb, Spark Azure blob, big query, Spark Hive/Glue, Spark Rest API, Spark XML, Spark Metastore/Catalog, Spark datalake (iceberg, delta), Spark Access Control/Authentication, Spark data migration, Spark pipeline