Tasks and Duties
Objective: In this task, you will demonstrate your ability to acquire a publicly available dataset, perform data cleaning, and preprocess the data using Python. The goal is to develop a detailed report that documents your process of handling missing values, outliers, and erroneous entries to prepare the dataset for further analysis.
Expected Deliverables: A DOC file that includes a comprehensive report with code snippets, explanatory text, and screenshots if applicable. The report must detail the data collection method, cleaning procedures, and preprocessing steps, along with challenges faced and how they were overcome.
Key Steps:
- Select a relevant public dataset from reliable sources.
- Perform initial data exploration to understand the structure and quality of the data.
- Identify and address missing values, inconsistencies, and outlier issues.
- Document each step with Python code and narrative commentary explaining the rationale behind your approach.
- Discuss potential impacts of preprocessing on subsequent analyses.
Evaluation Criteria: Your task will be evaluated based on clarity of documentation, thoroughness in addressing data quality issues, correctness of the Python code, and the overall structure and readability of the final DOC file. The report should be well-organized, thoroughly detailed, and include reflective commentary on the choices made during preprocessing. This task is expected to take approximately 30 to 35 hours of work, allowing enough time to deep dive into the data cleaning process and to present your findings effectively.
Objective: This week’s task focuses on performing exploratory data analysis (EDA) and visualization on a public dataset. The objective is to extract meaningful insights by using Python libraries such as Pandas, Matplotlib, and Seaborn. The documentation should detail your approach to analyzing the dataset, summarizing key findings, and creating visualizations that highlight trends and patterns.
Expected Deliverables: A DOC file that serves as your final report. It should include a detailed explanation of your EDA process, accompanied by code snippets, visuals, and interpretations of the findings. Ensure that your visualizations are properly annotated with titles, labels, and legends.
Key Steps:
- Choose a suitable public dataset for analysis.
- Perform initial analysis to summarize basic statistics and identify interesting features.
- Create multiple visualizations to uncover patterns, correlations, and anomalies within the data.
- Interpret the visualizations and provide insights into why these patterns may exist.
- Explain any data transformations or aggregations performed during the analysis.
Evaluation Criteria: Your submission will be evaluated based on the depth of your EDA, the quality and clarity of your visualizations, and the ability to interpret and discuss the results critically. Your DOC file should be structured with clear sections, making it easy to follow your analysis process. The overall quality of your code and its integration into the narrative will also be an important part of the evaluation. Plan for this task to take around 30 to 35 hours.
Objective: This task requires you to apply unsupervised learning techniques, specifically clustering, to a publicly available dataset. The goal is to segment the data into meaningful clusters and provide an in-depth analysis of each cluster’s characteristics. You should use Python libraries such as Scikit-learn to implement clustering algorithms like K-Means or Hierarchical Clustering.
Expected Deliverables: A DOC file that includes detailed sections on your methodology, the results of your clustering analysis, and interpretations of the clusters formed. Include code snippets, figures, and discussion on the rationale behind the chosen clustering techniques.
Key Steps:
- Select a publicly available dataset appropriate for clustering.
- Preprocess the dataset as necessary to enhance the performance of the clustering algorithm.
- Apply at least one clustering algorithm, detailing parameter choices such as the number of clusters.
- Visualize the clusters using appropriate plots and graphs.
- Discuss the characteristics of each cluster and potential business or research implications.
Evaluation Criteria: Submissions will be evaluated on the innovative application of clustering techniques, clarity and accuracy of the visualizations, and the depth of the cluster analysis. Your report must be well-organized with clear explanations, evidencing a strong grasp of unsupervised learning concepts. Work on this task is expected to take approximately 30 to 35 hours, enabling a comprehensive exploration of clustering methodologies and their applications.
Objective: In this task, you are required to design and implement a supervised learning model using Python. Choose between a regression or classification problem on a publicly available dataset and build a predictive model. Document every step of your process, from data preparation and feature engineering to model training, validation, and evaluation.
Expected Deliverables: A DOC file that presents a complete report on your modeling process. This document should include the rationale for the chosen model, detailed code segments, evaluation metrics, performance analyses, and a discussion on the strengths and limitations of the model.
Key Steps:
- Choose and clearly define a problem that can be addressed via regression or classification.
- Preprocess the data appropriately and conduct necessary feature engineering.
- Implement the chosen model using Python libraries such as Scikit-learn.
- Perform model validation using techniques like cross-validation and evaluate performance based on relevant metrics.
- Discuss possible model improvements and the implications of your findings.
Evaluation Criteria: Your task will be evaluated based on the coherence and completeness of your report, the effectiveness of your feature engineering, the accuracy of your model, and the quality of your analysis of the results. The final DOC file should be detailed and structured, demonstrating your understanding and practical skills in supervised learning. This task should be completed within an estimated 30 to 35 hours of focused work.
Objective: This week, you will explore the fundamentals of deep learning and apply these concepts to solve a specific problem using Python. The task will involve creating a neural network model using a popular deep learning framework such as TensorFlow or PyTorch on a publicly available dataset. Your goal is to highlight the process of designing, training, and evaluating a deep learning model, along with the challenges and insights encountered.
Expected Deliverables: A DOC file that serves as your comprehensive report outlining the project. It should include sections on the problem statement, architecture design, training process, evaluation metrics, and a critical analysis of the results. Ensure that you incorporate detailed code examples, diagrams of the network architecture, and justification for your design decisions.
Key Steps:
- Select a publicly available dataset that suits a deep learning approach.
- Define the problem clearly, whether it is classification, regression, or another task.
- Design a neural network model and justify your choice of architecture and hyperparameters.
- Train the model and evaluate its performance using appropriate metrics.
- Provide a deep dive into the challenges encountered, such as overfitting or resource constraints, and how you addressed these issues.
Evaluation Criteria: You will be evaluated on the clarity and thoroughness of the documentation, the innovation in your network design, the rigor of your model evaluation, and the quality of the insights provided. The final submission should comprehensively cover the deep learning process from start to finish and demonstrate that you have a good understanding of the practical applications of neural networks. Allocate approximately 30 to 35 hours for this task.
Objective: This final task is an integrative capstone project that requires you to combine all the key aspects of Data Science with Python that you have learned throughout the internship. You will undertake a comprehensive project that encompasses data acquisition, preprocessing, exploratory data analysis, model building (both supervised and unsupervised where applicable), and evaluation. The objective is to show that you can design and execute a full data science pipeline independently.
Expected Deliverables: A DOC file that contains a detailed project report. This report should have distinct sections for the problem statement, methodological approach, data collection, data cleaning, exploratory analysis, modeling, evaluation, and insights or recommendations. The report should also include relevant Python code examples, visualizations, and a reflective discussion on the project’s challenges and successes.
Key Steps:
- Identify a problem or opportunity that can be addressed using a data science approach with Python.
- Acquire and preprocess data from publicly available sources.
- Perform exploratory data analysis and identify key features of the dataset.
- Implement one or more modeling techniques (both supervised and unsupervised if applicable) to derive predictive or descriptive insights.
- Evaluate the performance of your models using appropriate metrics and discuss their implications.
- Conclude with recommendations and reflections on what further analysis could be done.
Evaluation Criteria: The capstone project will be assessed on the overall coherence of the data pipeline, the thoroughness of each project phase, the clarity of your code and explanations, and your ability to critically evaluate method performance. Your DOC file should clearly communicate your thought process, steps taken, and the final outcomes of your project. This comprehensive task should be completed in approximately 30 to 35 hours and serves as a culmination of your practical experience during the internship.