Based on the AWESOME repository.
CLICK HERE TO CHECK IT OUT
Content Management Systems
Health & Social Science
Actian Ingres – commercially supported, open-source SQL relational database management system.
ActorDB – a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
Amazon RedShift – data warehouse service, based on PostgreSQL.
BayesDB – statistic oriented SQL database.
Bedrock – a simple, modular, networked and distributed transaction layer built atop SQLite.
CitusDB – scales out PostgreSQL through sharding and replication.
Cockroach – Scalable, Geo-Replicated, Transactional Datastore.
Comdb2 – a clustered RDBMS built on optimistic concurrency control techniques.
Datomic – distributed database designed to enable scalable, flexible and intelligent applications.
FoundationDB – distributed database, inspired by F1.
Google F1 – distributed SQL database built on Spanner.
Google Spanner – globally distributed semi-relational database.
H-Store – is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
Haeinsa – linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
HandlerSocket – NoSQL plugin for MySQL/MariaDB.
InfiniSQL – infinity scalable RDBMS.
KarelDB – a relational database backed by Apache Kafka.
Map-D – GPU in-memory database, big data analysis and visualization platform.
MemSQL – in memory SQL database witho optimized columnar storage on flash.
NuoDB – SQL/ACID compliant distributed database.
Oracle TimesTen in-Memory Database – in-memory, relational database management system with persistence and recoverability.
Pivotal GemFire XD – Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
SAP HANA – is an in-memory, column-oriented, relational database management system.
SenseiDB – distributed, realtime, semi-structured database.
Sky – database used for flexible, high performance analysis of behavioral data.
SymmetricDS – open source software for both file and database synchronization.
TiDB – TiDB is a distributed SQL database. Inspired by the design of Google F1.
VoltDB – claims to be fastest in-memory database.
yugabyteDB – open source, high-performance, distributed SQL database compatible with PostgreSQL.
Axibase Time Series Database – Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
Chronix – a time series storage built to store time series highly compressed and for fast access times.
Cube – uses MongoDB to store time series data.
Heroic – is a scalable time series database based on Cassandra and Elasticsearch.
InfluxDB – a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
QuestDB – high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
IronDB – scalable, general-purpose time series database.
Kairosdb – similar to OpenTSDB but allows for Cassandra.
M3DB – a distributed time series database that can be used for storing realtime metrics at long retention.
Newts – a time series database based on Apache Cassandra.
TDengine – a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
OpenTSDB – distributed time series database on top of HBase.
Prometheus – a time series database and service monitoring system.
Beringei – Facebook’s in-memory time-series database.
TrailDB – an efficient tool for storing and querying series of events.
Druid – Column oriented distributed data store ideal for powering interactive applications
Riak-TS – Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
Akumuli Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word “akumuli” can be translated from esperanto as “accumulate”.
Rhombus – A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
Dalmatiner – DB Fast distributed metrics database
Blueflood – A distributed system designed to ingest and process time series data
Timely – Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
SiriDB – Highly-scalable, robust and fast, open source time series database with cluster functionality.
Thanos – Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
VictoriaMetrics – fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
Actian SQL for Hadoop – high performance interactive SQL access to all Hadoop data.
Apache Drill – framework for interactive analysis, inspired by Dremel.
Apache HCatalog – table and storage management layer for Hadoop.
Apache Hive – SQL-like data warehouse system for Hadoop.
Apache Calcite – framework that allows efficient translation of queries involving heterogeneous and federated data.
Apache Phoenix – SQL skin over HBase.
Aster Database – SQL-like analytic processing for MapReduce.
Cloudera Impala – framework for interactive analysis, Inspired by Dremel.
Concurrent Lingual – SQL-like query language for Cascading.
Datasalt Splout SQL – full SQL query engine for big datasets.
Dremio – an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
Facebook PrestoDB – distributed SQL query engine.
Google BigQuery – framework for interactive analysis, implementation of Dremel.
Materialize – is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
Invantive SQL – SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
PipelineDB – an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
Pivotal HDB – SQL-like data warehouse system for Hadoop.
RainstorDB – database for storing petabyte-scale volumes of structured and semi-structured data.
Spark Catalyst – is a Query Optimization Framework for Spark and Shark.
SparkSQL – Manipulating Structured Data Using Spark.
Splice Machine – a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
Stinger – interactive query for Hive.
Tajo – distributed data warehouse system on Hadoop.
Trafodion – enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
redpanda – A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
Amazon Kinesis – real-time processing of streaming data at massive scale.
Amazon Web Services Glue – serverless fully managed extract, transform, and load (ETL) service
Census – A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
Apache Chukwa – data collection system.
Apache Flume – service to manage large amount of log data.
Apache Kafka – distributed publish-subscribe messaging system.
Apache NiFi – Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
Apache Pulsar – a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
Apache Sqoop – tool to transfer data between Hadoop and a structured datastore.
Embulk – open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
Facebook Scribe – streamed log data aggregator.
Fluentd – tool to collect events and logs.
Gazette – Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
Heka – open source stream processing software system.
HIHO – framework for connecting disparate data sources with Hadoop.
Kestrel – distributed message queue system.
LinkedIn Databus – stream of change capture events for a database.
LinkedIn Kamikaze – utility package for compressing sorted integer arrays.
LinkedIn White Elephant – log aggregator and dashboard.
Logstash – a tool for managing events and logs.
Netflix Suro – log agregattor like Storm and Samza based on Chukwa.
Pinterest Secor – is a service implementing Kafka log persistance.
Linkedin Gobblin – linkedin’s universal data ingestion framework.
Skizze – sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
StreamSets Data Collector – continuous big data ingest infrastructure with a simple to use IDE.
Alooma – data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
RudderStack – an open source customer data infrastructure (segment, mParticle alternative) written in go.
Akka Toolkit – runtime for distributed, and fault tolerant event-driven applications on the JVM.
Apache Avro – data serialization system.
Apache Curator – Java libaries for Apache ZooKeeper.
Apache Karaf – OSGi runtime that runs on top of any OSGi framework.
Apache Thrift – framework to build binary protocols.
Apache Zookeeper – centralized service for process management.
Google Chubby – a lock service for loosely-coupled distributed systems.
Hydrosphere Mist – a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
Linkedin Norbert – cluster manager.
Mara – A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
OpenMPI – message passing framework.
Serf – decentralized solution for service discovery and orchestration.
Spotify Luigi – a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Spring XD – distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
Twitter Elephant Bird – libraries for working with LZOP-compressed data.
Twitter Finagle – asynchronous network stack for the JVM.
Apache Airflow – a platform to programmatically author, schedule and monitor workflows.
Apache Aurora – is a service scheduler that runs on top of Apache Mesos.
Apache Falcon – data management framework.
Apache Oozie – workflow job scheduler.
Azure Data Factory – cloud-based pipeline orchestration for on-prem, cloud and HDInsight
Chronos – distributed and fault-tolerant scheduler.
Cronicle – Distributed, easy to install, NodeJS based, task scheduler
Dagster – a data orchestrator for machine learning, analytics, and ETL.
Linkedin Azkaban – batch workflow job scheduler.
Schedoscope – Scala DSL for agile scheduling of Hadoop jobs.
Sparrow – scheduling platform.
Azure ML Studio – Cloud-based AzureML, R, Python Machine Learning platform
Oryx – Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
Concurrent Pattern – machine learning library for Cascading.
DataVec – A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
Deeplearning4j – Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
Decider – Flexible and Extensible Machine Learning in Ruby.
ENCOG – machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
etcML – text classification with machine learning.
Etsy Conjecture – scalable Machine Learning in Scalding.
Feast – A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
GraphLab Create – A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
H2O – statistical, machine learning and math runtime with Hadoop. R and Python.
Karate Club – An unsupervised machine learning library for graph structured data. Python
Keras – An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
Lambdo – Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
Little Ball of Fur – A subsampling library for graph structured data. Python
Mahout – An Apache-backed machine learning library for Hadoop.
MLbase – distributed machine learning libraries for the BDAS stack.
MLPNeuralNet – Fast multilayer perceptron neural network library for iOS and Mac OS X.
ML Workspace – All-in-one web-based IDE specialized for machine learning and data science.
MOA – MOA performs big data stream mining in real time, and large scale machine learning.
MonkeyLearn – Text mining made easy. Extract and classify data from text.
ND4J – A matrix library for the JVM. Numpy for Java.
nupic – Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
PredictionIO – machine learning server buit on Hadoop, Mahout and Cascading.
PyTorch Geometric Temporal – a temporal extension library for PyTorch Geometric .
RL4J – Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI’s Gym. Runs in the Deeplearning4j ecosystem.
SAMOA – distributed streaming machine learning framework.
scikit-learn – scikit-learn: machine learning in Python.
Shapley – A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
Spark MLlib – a Spark implementation of some common machine learning (ML) functionality.
Sibyl – System for Large Scale Machine Learning at Google.
TensorFlow – Library from Google for machine learning using data flow graphs.
Theano – A Python-focused machine learning library supported by the University of Montreal.
Torch – A deep learning library with a Lua API, supported by NYU and Facebook.
Velox – System for serving machine learning predictions.
Vowpal Wabbit – learning system sponsored by Microsoft and Yahoo!.
WEKA – suite of machine learning software.
BidMach – CPU and GPU-accelerated Machine Learning Library.
Apache Hadoop Benchmarking – micro-benchmarks for testing Hadoop performances.
Berkeley SWIM Benchmark – real-world big data workload benchmark.
Intel HiBench – a Hadoop benchmark suite.
PUMA Benchmarking – benchmark suite for MapReduce applications.
Yahoo Gridmix3 – Hadoop cluster benchmarking from Yahoo engineer team.
Apache Ranger – Central security admin & fine-grained authorization for Hadoop
Apache Eagle – real time monitoring solution
Apache Knox Gateway – single point of secure access for Hadoop clusters.
Apache Sentry – security module for data stored in Hadoop.
BDA – The vulnerability detector for Hadoop and Spark
Apache Ambari – operational framework for Hadoop mangement.
Apache Bigtop – system deployment framework for the Hadoop ecosystem.
Apache Helix – cluster management framework.
Apache Mesos – cluster manager.
Apache Slider – is a YARN application to deploy existing distributed applications on YARN.
Apache Whirr – set of libraries for running cloud services.
Apache YARN – Cluster manager.
Brooklyn – library that simplifies application deployment and management.
Buildoop – Similar to Apache BigTop based on Groovy language.
Cloudera HUE – web application for interacting with Hadoop.
Facebook Prism – multi datacenters replication system.
Google Borg – job scheduling and monitoring system.
Google Omega – job scheduling and monitoring system.
Hortonworks HOYA – application that can deploy HBase cluster on YARN.
Kubernetes – a system for automating deployment, scaling, and management of containerized applications.
Marathon – Mesos framework for long-running services.
Linkis – Linkis helps easily connect to various back-end computation/storage engines.
411 – an web application for alert management resulting from scheduled searches into Elasticsearch.
Adobe spindle – Next-generation web analytics processing with Scala, Spark, and Parquet.
Apache Metron – a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
Apache Nutch – open source web crawler.
Apache OODT – capturing, processing and sharing of data for NASA’s scientific archives.
Apache Tika – content analysis toolkit.
Argus – Time series monitoring and alerting platform.
AthenaX – a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
Atlas – a backend for managing dimensional time series data.
Countly – open source mobile and web analytics platform, based on Node.js & MongoDB.
Domino – Run, scale, share, and deploy models — without any infrastructure.
Eclipse BIRT – Eclipse-based reporting system.
ElastAert – ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
Eventhub – open source event analytics platform.
HASH – open source simulation and visualization platform.
Hermes – asynchronous message broker built on top of Kafka.
Hunk – Splunk analytics for Hadoop.
Imhotep – Large scale analytics platform by indeed.
Indicative – Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
Jupyter – Notebook and project application for interactive data science and scientific computing across all programming languages.
MADlib – data-processing library of an RDBMS to analyze data.
Kapacitor – an open source framework for processing, monitoring, and alerting on time series data.
Kylin – open source Distributed Analytics Engine from eBay.
PivotalR – R on Pivotal HD / HAWQ and PostgreSQL.
Rakam – open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
Qubole – auto-scaling Hadoop cluster, built-in data connectors.
SnappyData – a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
Snowplow – enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
SparkR – R frontend for Spark.
Splunk – analyzer for machine-generated data.
Sumo Logic – cloud based analyzer for machine-generated data.
Talend – unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.