- A very normal Data Engineering work 🎉
- What can go wrong in Distributed Data Systems
- The Berkeley View on Cloud Computing - Paper
- Dolt is Git for Data 🎊
- Everything Around PySpark Pandas UDF 📖
- Architect and build an #machinelearning use case end to end using Amazon SageMaker 🎉
- Around Data Discovery or Metadata Management Platforms
- Amazon S3 Object Lambda - Provide Different Views of Data to Multiple Applications
- The Google File System - The Paper 🎉
- Toward Better Data Culture From First Principles by Ube
- Getting started with #dataengineering Volume 6 🎉
- Getting started with Dataengineering Volume 5 🎉
- Getting started with Data Engineering, volume 4 🎉💡
- Getting started with Data Engineering, volume 3 🎉💡
- Getting started with Data Engineering, volume 2 🎉💡
- Getting started with Data Engineering, volume 1 🎉💡
- Apache Airflow 2.0
- Some Interesting essentials while learning Apache Airflow
- Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release 🎉🎉🎉
- #getdbt or Data Build Tools interface across all major Data Workflow Management Platform 💯✨🔥
- Apache Superset - An #opensource Fully Featured Business Intelligence Application 🎊🎊🎊
- The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration 💯💡⭐
- Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! 🎉
- Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL 🎉
- Apache Spark is NOT a Map Reduce but an MPP/MPI Engine
- Apache Hudi - Design Principles
- OpenTelemetry specification V1.0
- DataEngg Skills to work with DataScience
- Data Quality, A necessity for Data Driven Projects
- Essential Cloud Skills for Data Engineering
- Open Source Technologies in Data Engineering
- Kubernetes Fundamentals Required as a Data Engineer
- Apache Superset, OSS Business Intelligence for 2021
- #apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional 💬 quotes ⏳💡⏳
- Processing Guarantees in #apachekafka 💯🔆🎉 - The best resource
- Change Data Analysis with Debezium and Apache Pinot 🎉💡🚿
- Optimizing Apache Kafka Producers & Consumers 🎊📈🎉
- Redpanda -A NON-JVM Streaming Platform for mission critical workloads 💡🎉🔆
- Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
- Apache Iceberg - an open table format for huge analytic datasets
- Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
- ZooKeeper, a distributed, open-source coordination service for distributed applications
- Apache Iceberg - Partition Evolution, its simple but its so amazing
- A Data Engineering Story - The Beginning
- Data Engineering - More towards Data Science or Data Analytics or ...
- Data Engineering Interview Patterns
- Basic Checklists while learning Apache Spark
- #apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
- Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem 🔐💡🔒
- Nextflow is a Workflow Manager exclusively for #bioinformatics 🩹💊🩹
- #apachespark Project Zen Update - Making PySpark Better 💡🔗💡
- Design - Exactly Once Delivery & Transactional Messaging in #apachekafka 🎊📋🎊
- underrated but important skill of a Data Engineer
- Fallacies of Distributed Systems
- As a Data Engineer, some Essentials I did which really helped Data Scientists and the Team
- SQL Database on Kubernetes - Best Practices
- Devtron - An Open Source DevOps on Kubernetes, written in Go 🥇🎁🎉
- Most Popular #opensource BI & Data Analytics Platforms 🎊💡🎉
- datapipelines Dataframe APi is now available with #apachebeam 💯🔥💯
- Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink 🔅🎉🔅
- Kubernetes Api Structure 💯✔️💯
- Architecting a Kubernetes Infrastructure 💯
- Exploring Kubernetes Operator Pattern 💡
- Docker is an interal part of Data Engineering ML pipeline & that makes security 🔐 extremely essential
- Rack awareness for #apachekafka Streams Proposal 🎉
- Machine Learning Workflow 💯
- Dummy Notes On Machine Learning Infrastructire
- Machine Learning Feature Store 💯
- Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
- List of #machinelearning & #dataengineering Technologies will be following in 2021 🎉💡🎉
- MLOps - ZenML #machinelearning with reproducible pipelines ✅💯✅
- Streamlit Healthcare Machine Learning Data App
- Dstack AI - An open-source tool to develop data applications with Python 🎉💭🎊
- Adversarial Robustness Toolbox - a Python library for #machinelearning Security 💡🔎🔓
- Biopython is a set of freely available tools for biological computation written in #Python 💊⌛️💊
- Time to Know More about DASK
- DataEngineering vs Machine Learning
- A good #machinelearning Model is only possible with a good quality of #data. ⌛️
- Statistics for #softwareengineer 🔥💯🔥
- Monitoring #machinelearning Applications 🎉🎁 🎉
- Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven 🥇🥇🥇
- Short Notes on -Open source #machinelearning Tracking System
- The best example of Randomness is - #machinelearning model in Production. 🔐💭🔎
- Flyte is declarative, structured, and highly scalable cloud-native workflow orchestration platform for Distributed Machine Learning
- The Snowflake Paper - Core idea is to build an enterprise-ready #datawarehouse solution for the #cloud 🎉📰📕
- Most important points around Distributed #dataengineering Platform
- Fundamental of #distributedsystems Scaling - Avoiding Co-ordination 🎊♨️🔆
- Technical Debt in #dataengineering #softwareengineering 🔕💡🔕
- Paper on Wander Join: Online Aggregation via Random Walks 📃💭📑 Join problem
- The Delta Lake Paper - High-Performance ACID Table Storage 📋💡📋
- Dynamo - AWS Highly Available Key-value Store #distributedsystem 💬💡🎉
- An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, A Single SQL for all 💡📩📩
- Secure & Robust Machine Learning in #healthcare 💊🧪🥳
- Progress in Medical Science using #deeplearning 💊💡💉
- The Amazon Redshift Paper - A fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing #businessintelligence tools 📂📰💭
- Advancing #drugdiscovery via Artificial Intelligence 💊🏥🏥
- Apache Calcite is a dynamic data management framework 🎉📚🎉
- Lakehouse - A Paper on new Generation of #datawarehouse technology 💡🔎💡
- Calvin: Fast Distributed Transactions for Partitioned Database Systems 📝📝
- Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto 💭🎊💡
- Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
- Apache Kafka Paper : Distributed Messaging System for Log Processing
- Paper: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size
- Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
- Azure Data Lake Store(ADLS) is a fully-managed, elastic, scalable, and secure file system that supports #hadoop distributed file system (HDFS) and Cosmos semantics
- An LFU (Least Frequently Used) Cache eviction algorithm of O(1) Runtime complexity