Pequeno Dicionário de ferramentas para Engenharia de dados
https://druid.apache.org/
Apache Druid is a high performance real-time analytics database.
https://pinot.apache.org/
Realtime distributed OLAP datastore, designed to answer OLAP queries with low latency
https://spark.apache.org/
https://cloud.google.com/learn/what-is-apache-spark
Apache Spark™ is a unified analytics engine for large-scale data processing.
https://kafka.apache.org/
https://www.confluent.io/what-is-apache-kafka
Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.
http://samza.apache.org/
A distributed stream processing framework
Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka.
Kafka Connect is a free, open-source component of Apache Kafka® that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.
https://flink.apache.org/
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale
https://storm.apache.org/
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use!
Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
https://beam.apache.org/
An advanced unified programming mode
Implement batch and streaming data processing jobs that run on any execution engine.
https://superset.apache.org
Apache Superset is a modern data exploration and visualization platform
https://hive.apache.org/
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.
https://strimzi.io/
https://strimzi.io/documentation/
Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations
https://aws.amazon.com/pt/emr
Execute e escale facilmente o Apache Spark, o Hive, o Presto e outras estruturas de big data
O Amazon EMR é a plataforma de big data em nuvem líder do setor para processar grandes quantidades de dados usando ferramentas de código aberto, como Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi e Presto.
https://aws.amazon.com/pt/glue/
Integração de dados simples, escalável e sem servidor
AWS Glue é um serviço de integração de dados sem servidor que facilita descobrir, preparar e combinar dados para análise, machine learning e desenvolvimento da aplicação. Ele oferece todos os recursos necessários para a integração dos dados, portanto é possível começar a analisar seus dados e usá-los em minutos, ao invés de meses.
https://aws.amazon.com/pt/Quicksight/
O QuickSight permite que você crie e publique facilmente painéis interativos que incluem o Insights de Machine Learning.
https://docs.aws.amazon.com/pt_br/athena/index.html
O Amazon Athena é um serviço de consultas interativas que facilita a análise de dados no Amazon S3 usando SQL padrão. O Athena não exige um servidor. Não há necessidade de configurar ou gerenciar infraestrutura e você paga apenas pelas consultas executadas. Para começar a usar, basta apontar para os dados no S3, definir o schema e iniciar as consultas usando SQL padrão.
https://aws.amazon.com/pt/s3/
Armazenamento de objetos construído para armazenar e recuperar qualquer volume de dados de qualquer local
O Amazon Simple Storage Service (Amazon S3) é um serviço de armazenamento de objetos que oferece escalabilidade, disponibilidade de dados, segurança e performance líderes do setor.
https://aws.amazon.com/pt/rds/
Configure, opere e escale um banco de dados relacional na nuvem com apenas alguns cliques. O Amazon Relational Database Service (Amazon RDS) facilita a configuração, a operação e a escalabilidade de bancos de dados relacionais na nuvem
https://hadoop.apache.org/
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
https://ksqldb.io/
The database purpose-built for stream processing applications
Seamlessly leverage your existing Apache Kafka® infrastructure to deploy stream-processing workloads and bring powerful new capabilities to your applications.
https://kubernetes.io/pt-br/
Orquestração de contêineres prontos para produção
Kubernetes (K8s) é um produto Open Source utilizado para automatizar a implantação, o dimensionamento e o gerenciamento de aplicativos em contêiner
Ele agrupa contêineres que compõem uma aplicação em unidades lógicas para facilitar o gerenciamento e a descoberta de serviço. O Kubernetes se baseia em 15 anos de experiência na execução de containers em produção no Google, combinado com as melhores ideias e práticas da comunidade.
https://kubernetes.io/docs/reference/kubectl/overview/
The kubectl command line tool lets you control Kubernetes clusters
https://ahmet.im/blog/kubectx/
kubectx: a tool to switch between Kubernetes contexts
https://aws.amazon.com/pt/eks/
Amazon Elastic Kubernetes Service
O Amazon Elastic Kubernetes Service (Amazon EKS) é um serviço Kubernetes totalmente gerenciado
https://www.python.org/
Python is a programming language that lets you work quickly and integrate systems more effectively
https://git-scm.com/
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency
https://www.postgresql.org/
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
https://confluent.io
You love Apache Kafka, but not managing it. Our fully managed service means your best people can now focus on delivering value to your customers.
https://www.docker.com
A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
https://www.elastic.co/pt/what-is/elasticsearch
O Elasticsearch é um mecanismo de busca e análise de dados distribuído, gratuito e aberto para todos os tipos de dados, incluindo textuais, numéricos, geoespaciais, estruturados e não estruturados. O Elasticsearch é desenvolvido sobre o Apache Lucene e foi lançado pela primeira vez em 2010 pela Elasticsearch N.V
https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times
https://kubernetes.io/pt-br/docs/reference/kubectl/cheatsheet/
https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1
https://www.unraveldata.com/resources/catalyst-analyst-a-deep-dive-into-sparks-optimizer/