Focus: Advanced Python concepts, data structures, algorithms, and coding exercises.
- Topics to Cover:
- Advanced Python:
- List, dict, and set comprehensions.
- Decorators, context managers (with statements).
- Generators and iterators.
- OOP Concepts:
- Abstract classes and interfaces.
- Polymorphism and inheritance.
- Design patterns: Singleton, Factory, etc.
- Async Programming:
asyncio
,coroutines
,async/await
.
- Error Handling and Logging:
- Custom exceptions.
- Logging best practices.
- Python for Data Engineering:
- Working with large files (
csv
,json
,parquet
). - Serialization/deserialization (
pickle
,JSON
, etc.). - Performance optimization (
multiprocessing
,concurrent.futures
)
- Working with large files (
- Advanced Python:
- Actionable Exercises:
- Implement a custom decorator for memoization.
- Write a context manager to handle file operations with error handling.
- Develop an async web scraper using
aiohttp
orhttpx
. - Solve algorithm problems on LeetCode (Medium/Hard) related to:
- Arrays (e.g., sliding window problems).
- Hashmaps and strings (e.g., anagrams, longest substrings).
- Trees and graphs (e.g., BFS, DFS)
Focus: Writing efficient, optimized queries, working with complex datasets, and core AWS services for data pipelines.
- Topics to Cover:
- SQL Essentials:
- Joins (INNER, OUTER, CROSS).
- Aggregations (GROUP BY, HAVING).
- Subqueries and Common Table Expressions (CTEs).
- Performance Optimization:
- Indexing strategies.
- Query execution plans (EXPLAIN).
- Window functions (ROW_NUMBER, RANK, LAG/LEAD).
- Advanced Topics:
- Recursive queries.
- Partitioning tables.
- SQL Essentials:
- Actionable Exercises:
- Given two large datasets (e.g.,
users
andtransactions
), write queries to:- Find users with the highest transaction volume in the last 3 months.
- Calculate moving averages of transaction amounts.
- Generate a report showing top-performing users grouped by region.
- Use online SQL platforms like HackerRank or LeetCode SQL for challenges.
- Given two large datasets (e.g.,
Focus: Core AWS services and their integration for data pipelines.
- Topics to Cover:
- Compute:
- AWS Lambda, EC2.
- Storage:
- S3. Glacier. DynamoDB.
- Database:
- RDS. Redshift.
- Workflow Orchestration:
- AWS Step Functions.
- Managed Airflow.
- Infrastructure as Code:
- Basics of CloudFormation or Terraform.
- Compute:
- Actionable Exercises:
- Create an AWS Lambda function to process a dataset from S3 and store results in DynamoDB.
- Design an S3-to-Redshift ETL pipeline and write a Python script to load data.
- Explore AWS Free Tier services and practice basic setups for S3, Lambda, and RDS.
Focus: Using Pandas, Databricks, and Spark for ETL and data manipulation.
- Topics to Cover:
- Pandas:
- Efficient DataFrame manipulations (apply, merge, groupby).
- Handling missing data (fillna, interpolate).
- Working with large datasets (chunksize, dask for scaling Pandas).
- Databricks and Apache Spark:
- Introduction to Spark's architecture (RDD, DataFrame, Dataset API).
- Writing PySpark jobs for ETL pipelines.
- Optimizing queries with caching and partitioning.
- ETL Patterns:
- Data cleaning, transformation, and validation.
- Designing idempotent ETL pipelines.
- Pandas:
- Actionable Exercises:
- Pandas:
- Load a large dataset (e.g., 10 million rows) and perform:
- Cleaning (e.g., removing duplicates, imputing missing values).
- Aggregation (e.g., sales by month).
- Visualization (e.g., plotting trends using matplotlib or seaborn).
- Load a large dataset (e.g., 10 million rows) and perform:
- Databricks:
- Set up a free Databricks workspace and:
- Process a CSV file into Parquet format.
- Write and execute a Spark SQL query.
- Set up a free Databricks workspace and:
- ETL Pipeline:
- Build a pipeline that:
- Reads data from an S3 bucket.
- Cleanses and transforms the data with Pandas or Spark.
- Writes the results to a Redshift table.
- Build a pipeline that:
- Pandas:
Mock Interviews and Review
- Python Live Coding Practice:
- Solve live problems from platforms like CoderPad or Pramp.
- Write an end-to-end Python script for an ETL pipeline.
- SQL:
- Conduct mock interviews focused on SQL query design and optimization.
- AWS/Cloud:
- Be ready to explain cloud service designs and trade-offs (e.g., why use S3 vs. RDS?).
- Data Engineering Tools:
- Prepare to write Pandas or PySpark transformations live.
- Design Patterns:
- How would you implement a Singleton pattern in Python? When would you use it in real projects?
- Memory Management:
- Explain how Python handles memory allocation. What are garbage collection and reference counting?
- Concurrency:
- Write a program to scrape multiple URLs concurrently using asyncio.
- Error Handling:
- How do you design a robust error-handling mechanism for a file-processing pipeline?
- Code Optimization:
- Refactor a function with nested loops processing large datasets to improve performance.
- Testing:
- How do you write unit tests for a function that interacts with a third-party API?
- OOP:
- Implement a class-based solution for a parking lot management system.
- Generators:
- How would you use a generator to handle streaming data efficiently in Python?
- Large Files:
- How would you read and process a 10GB CSV file in Python without loading it all into memory?
- Database Integration:
- Write a Python script that connects to a PostgreSQL database, fetches data, and performs transformations.
- Query Optimization:
- Analyze a poorly written query and optimize it to reduce execution time.
- Joins:
- Write a query to find customers who made purchases in two consecutive months.
- Window Functions:
- Calculate a 3-month rolling average for sales data using a window function.
- CTEs and Recursive Queries:
- Use a recursive CTE to find all employees under a manager in an organization hierarchy.
- Indexes:
- Explain how indexes work and why they can sometimes slow down performance.
- Data Aggregation:
- Write a query to group sales data by region and month, including the total and average sales.
- Schema Design:
- Design a schema for a library system. Explain your normalization decisions.
- Data Cleaning:
- How would you identify and remove duplicate entries in a database table?
- Transaction Management:
- Explain how you would ensure data consistency in a banking system with concurrent transactions.
- Error Debugging:
- You receive an error in a SQL query that uses a GROUP BY clause. How do you troubleshoot it?
- S3 Bucket Policies:
- How do you restrict public access to an S3 bucket while allowing specific users or services?
- Lambda Functions:
- Design an AWS Lambda function to process a stream of events from Kinesis and store results in DynamoDB.
- EC2 Scaling:
- How do you set up auto-scaling for an EC2 instance in response to traffic spikes?
- Serverless Architecture:
- When would you choose serverless over traditional EC2-based deployments? Provide real-world examples.
- CloudFormation:
- Write a basic CloudFormation template to provision an S3 bucket and a DynamoDB table.
- Networking:
- How do you set up a VPC with a public and private subnet? Why is this architecture useful?
- Data Transfer:
- What are the options to transfer 1TB of data from an on-premise server to S3?
- AWS IAM:
- How would you manage permissions for multiple users in an organization?
- Monitoring:
- Explain how you would use CloudWatch to monitor the performance of an application hosted on AWS.
- AWS Glue:
- How would you use AWS Glue to create an ETL pipeline for transforming data from S3 to Redshift?
- Pandas Performance:
- How would you optimize a Pandas DataFrame operation to process a dataset with 10 million rows?
- Data Validation:
- Write a Pandas pipeline to clean, validate, and enrich raw customer data.
- PySpark Basics:
- Explain the differences between Spark RDDs and DataFrames. When would you use one over the other?
- Databricks:
- How do you manage a large ETL job in Databricks, ensuring scalability and fault tolerance?
- Partitioning in Spark:
- How does partitioning improve Spark job performance? Write an example.
- Distributed Computing:
- Explain how Spark handles fault tolerance in a distributed system.
- Data Lake:
- What are the challenges of designing a data lake? How do tools like Databricks address them?
- ETL Pipelines:
- Design an ETL pipeline to ingest JSON data from an API, transform it, and store it in a SQL database.
- Data Serialization:
- Discuss the pros and cons of using Avro, Parquet, and JSON for storing large datasets.
- Real-Time Data:
- Write a PySpark job to process streaming data from Kafka and store aggregated results in HDFS.