Data Engineer Hiring Project
Build a simple analytics platform for a Fake Insurance company using the Kaggle dataset Agency Performance Model. This platform has two components: An ETL or Data Pipelines and an API.
Minimum Requirements
Build a Data Pipeline/ETL process that takes the CSVs as input and saves into a database at a detailed level while also calculating summarized views. These summarized views could follow star schema or any other that you think will allow for easy querying using different pivots/dimensions. The Data Pipeline can be manually triggered by running a script (include instructions of how to do it!) or automated somehow.
Build an API (REST or GraphQL) that provides:
Detailed information using different parameters (like agency, month, year, state, etc)
Summarized information using different parameters (like agency, month, year, state, etc)
An XLS, XLSX or CSV report with Premium info by Agency and Product Line using date range as parameters
The Data Pipeline/ETL process and also the logic for generating the report must be done using Pandas
Deployment to AWS
ETL/Data Pipeline Flow
Tech Stack Requirements
The following are requirements on the tech stack. This stack demonstrates mastery of tools our team favors:
Server-Side Development: Python 2.7+ or 3.5+ and Pandas for the API and report
Server Framework: Django, Flask or SimpleHTTPServer
Make sure that your instructions for accessing or otherwise running your code are extremely clear.
Bonus points
We know people may have jobs or other important things to do, leaving them little time available to complete our project. The above are the minimum requirements. Any of the following could make you stand out from the crowd by showing you current proficiency with other skills and tools:
Integration or Unit tests (at least one of those). You can use pytest or unittest
Authentication so that only authorized users can query the API
Tests with good test coverage
Documented code and that follows pep8 and The Zen of Python
API documentation
Using docker for deployment
Using other AWS stack components relevant for Data Engineering (Lambdas, S3, DynamoDb, Cognito, etc.)
Using any CI service like Travis, Shippable, Circle CI, etc. for running the tests
Including some predictive analysis like forecasting or categorization as part of the API
Incremental ETL that only processes and loads new records
Build a web app that consumes the API (we use Vue.js and Knockout)
Show some charts, tables, dashboards
Let users run reports with different input parameters and date ranges from there
Guidelines
We're looking for someone who can work independently and is curious and self-motivated. One major goal of this project is to see how you fill in ambiguities creatively. There is no such thing as a perfect project here, just interpretations of the instructions above, so be creative in your approach.
Deliverables
In order to move your application forward, deliverables will include:
A deployed version of your project, on AWS.
A GitHub repo containing your project. Your repo must contain these two items:
A detailed README that explains your approach and deployment method
Your code solution to this test
Adding of these items to your resume's cover letter:
The link to the GitHub repo that lists this project
The link to the deployed version of your project
Uploading of your resume with cover letter in PDF or DOCX format by clicking this link.