Tasks and Duties
Objective
This task focuses on the crucial initial stage of data science – performing Exploratory Data Analysis (EDA) and data cleaning. You are expected to demonstrate how to approach a raw dataset using Python by identifying relevant insights, inconsistencies, and necessary cleaning routines to prepare the data for further analysis.
Expected Deliverables
- A DOC file report summarizing your EDA findings and data cleaning procedures.
- Python code snippets and explanations embedded in your report.
- Visual aids like charts, histograms, and scatter plots that illustrate data distributions and relationships.
Key Steps to Complete the Task
- Data Selection: Choose a publicly available dataset relevant to any domain of your interest.
- Data Exploration: Analyze the dataset to assess data types, missing values, outliers, and overall data structure using Python libraries (pandas, matplotlib, seaborn, etc.).
- Data Cleaning: Describe and implement techniques to handle missing values, normalize data, and remove outliers. Document your decision-making process.
- Visualization: Create meaningful visualizations to support your analysis.
- Report Composition: Compile your exploration, analysis steps, code excerpts, visualizations, and interpretations into a comprehensive DOC file.
Evaluation Criteria
Your submission will be assessed based on the clarity of your analysis, thorough explanation of the cleaning process, quality and relevance of visualizations produced, and the overall comprehensiveness and professional presentation of your report.
The task is designed to take approximately 30-35 hours, providing you a detailed insight into the practical aspects of EDA as a foundational component in the data science workflow, which is essential for any aspiring data scientist.
Objective
This task aims to build your ability to create compelling data visualizations that clearly communicate insights and tell a data-driven story. You will focus on transforming raw data into interactive visual narratives using Python visualization libraries, and then summarizing your approach and findings in a detailed DOC report.
Expected Deliverables
- A DOC file that includes your data visualization journey, analysis narrative and critical insights.
- Annotated Python code demonstrating how you created each visualization.
- Embedded screenshots or saved images of your visualizations accompanied by descriptions.
Key Steps to Complete the Task
- Topic and Data Selection: Select a public dataset and identify a story you want the data to tell.
- Visualization Creation: Utilize Python libraries (such as matplotlib, seaborn, and Plotly) to create at least three different types of visualizations (e.g., line chart, bar chart, heat map).
- Interpretation: Analyze each visualization to extract key insights and correlations in the data.
- Documentation: Write a detailed report in a DOC file that documents your process, the rationale behind your visualization choices, and the final insights.
- Enhancements: Include suggestions on how the visualizations could be refined or extended further.
Evaluation Criteria
Your work will be evaluated on creativity, clarity of visualizations, depth of analysis, and how effectively the final report communicates your data visualization process and conclusions. Make sure that your report is precise, logically organized, and written in a professional tone.
This assignment is expected to take approximately 30-35 hours, helping you develop critical skills in the art of data storytelling and visualization—a key component in data science communication.
Objective
This task is designed to immerse you in the process of developing a basic machine learning model using Python. You will choose a problem statement, preprocess your data, build and evaluate a simple machine learning model, and document the entire lifecycle in a comprehensive report.
Expected Deliverables
- A DOC file report detailing the machine learning project journey.
- Clear explanations of the data preprocessing steps, model selection, and evaluation metrics.
- Python code snippets embedded within the report.
Key Steps to Complete the Task
- Problem Definition: Define a clear problem statement using a publicly available dataset.
- Data Preprocessing: Clean and prepare your dataset, addressing issues such as missing values, feature scaling, and encoding qualitative data.
- Model Building: Develop a machine learning model using a relevant Python library (e.g., scikit-learn). Choose an appropriate algorithm for classification or regression.
- Model Evaluation: Use techniques such as cross-validation, and calculate key metrics (accuracy, precision, recall, RMSE) to evaluate the performance.
- Documentation: Compile your methodology, code, model performance and potential improvements in a DOC file. Include visual representations of your results.
Evaluation Criteria
Your report will be assessed based on the soundness of your model building process, the clarity of your data preprocessing and evaluation methods, as well as the overall coherence and professional quality of your documented submission. This assignment, scheduled for 30-35 hours of work, is designed to solidify your foundational understanding of machine learning model development within the data science workflow.
Objective
The focus of this task is to design and build an automated data pipeline using Python. This workflow should demonstrate your understanding of automating the process of data ingestion, processing, and output generation to support routine analysis tasks. You will develop a solution that documents a step-by-step automated workflow that enhances efficiency in data handling.
Expected Deliverables
- A fully detailed report in a DOC file outlining your automated data pipeline.
- Explanatory Python code that demonstrates the workflow process.
- Diagrams and flowcharts to visualize the pipeline architecture and data flow.
Key Steps to Complete the Task
- Pipeline Planning: Identify a real-world scenario requiring automated data processing. Use a public dataset to simulate the data flow.
- Design: Create a workflow diagram illustrating each step from data ingestion through processing to final results.
- Implementation: Develop Python code to simulate the pipeline. Use libraries such as pandas, SQLAlchemy (if applicable), and scheduling tools like cron or Airflow-inspired methods.
- Testing: Test each segment of the pipeline and log errors or potential improvements.
- Reporting: Document every step of your pipeline creation, including design decisions, encountered challenges, and performance evaluations, in a DOC file.
Evaluation Criteria
The submission will be evaluated on the pipeline’s efficiency, clarity, and robustness along with your documentation’s insightfulness and ability to communicate the process in a structured, professional manner. This task is structured to take approximately 30-35 hours and aims to cultivate your skills in automating data-centric tasks—a critical ability in modern data science roles.
Objective
This task is designed to combine web scraping techniques and text data analysis using Python. You will gather data from publicly accessible websites, preprocess it, and perform a sentiment analysis or topic modeling using natural language processing (NLP) techniques. The final goal is to derive actionable insights from unstructured text data, and present your methodology and results in a structured report.
Expected Deliverables
- A DOC file report that includes a narrative of your web scraping process, data preprocessing methods, analysis steps, and final insights.
- Annotated Python code snippets demonstrating your approach.
- Visual representations of text analysis outcomes such as word clouds, sentiment graphs, or topic distributions.
Key Steps to Complete the Task
- Data Collection: Identify a target website or a set of websites from which you can legally scrape data related to a topic of your choice.
- Web Scraping Procedures: Use Python libraries like Beautiful Soup or Scrapy to extract the desired data, and ensure ethical practices and compliance with website policies.
- Data Preprocessing: Clean the scraped text data by removing irrelevant content, stopwords, and performing tokenization.
- Analysis: Apply text analysis methods such as sentiment analysis using libraries like TextBlob or topic modeling using LDA. Visualize the results effectively.
- Documentation: Compose a DOC file report that thoroughly documents your approach from web scraping through text analysis. This should include both the technical process and the actionable insights derived from the results.
Evaluation Criteria
Your submission will be evaluated on the completeness of your scraping and analysis process, the accuracy and clarity of your Python implementation, and the professionalism of your final report. The task is intended to take about 30-35 hours, reinforcing your skills in obtaining, processing, and analyzing unstructured text data—a valuable capability for any data science professional.