Tasks and Duties
Week 1 Task: Data Lake Architecture Design:
For the first week, you are tasked with designing a robust, scalable, and efficient Data Lake architecture using a technology of your choice (AWS, Azure, GCP etc). Your design should consider factors such as data ingestion, storage, processing, and security measures. Expected deliverable is a DOC file containing detailed design of the architecture, reasoning behind choosing specific services, and an explanation of how data flows within the system. Evaluation will be based on the robustness, scalability, and cost efficiency of the design, and the clarity and depth of your explanations.
Week 2 Task: ETL Pipeline Development:
This week focuses on the development of an ETL (Extract, Transform, Load) pipeline. Choose a publicly available dataset and design an ETL process using it. You should document the steps of extracting data, transforming it to suit the target database, and loading it into the database. A DOC file documenting the entire process, challenges faced and how they were mitigated, and a brief explanation of the code should be submitted. Evaluation will be based on how well the ETL process is designed, the efficiency of the pipeline, and the clarity of your documentation.
Week 3 Task: Data Quality Assurance:
This week, you will work on data quality assurance. Choose a publicly available dataset, perform data cleaning, validation and conduct checks for duplicates, missing values, and inconsistencies. Deliver a DOC file that outlines the data quality assurance steps, findings, and the data cleaning process you implemented. Evaluations will be made based on completeness and accuracy of the data quality assurance steps, and the effectiveness of the data cleaning process.
Week 4 Task: Data Modeling:
This week's task is to perform data modeling. Use a publicly available dataset to create a logical data model and a physical data model. The DOC file deliverable should include the models, a step-by-step explanation of how you created them, and what business questions these models will help answer. Evaluation will be based on the correctness and relevance of the data models, and how effectively the models answer the business questions.
Week 5 Task: Data Governance Strategy:
For the fifth week, you will develop a data governance strategy. The strategy should cover data ownership, data quality control, data privacy, and data security. The DOC file deliverable should contain the data governance strategy, reasons behind the strategy choices, and how this strategy would be implemented and maintained. Evaluation will be based on the comprehensiveness of the strategy, the feasibility of implementation, and the clarity of your explanations.
Week 6 Task: Performance Tuning of Data Pipelines:
For the final week, you will focus on the performance tuning of data pipelines. Choose a data pipeline and identify performance bottlenecks. Develop a strategy to optimize the pipeline and improve its performance. The DOC file deliverable should contain a detailed analysis of the bottlenecks, the optimization strategy, and the expected outcome of the optimization. Evaluation will be based on the accuracy of the bottleneck identification, the effectiveness of the optimization strategy, and the clarity of your analysis and explanations.