Dealing with data on a huge scale
Logging events then storing it in a way that can be quickly retrieved later (either for batch processing, or user-facing key-value lookups) is non-trivial
Working with a number of off-the-shelf tools including Spark, EMR, DynamoDb, S3, RedShift, Mysql, but often push them past their limits
Keeping this data pipeline running smoothly and improving it is half the job, the other half is wrangling the data for application purposes
Join the datasets above in novel ways - Is MapReduce or Spark the right tool for the job? How much memory or compute resources are needed to load a new model on years of impression data? Will the job finish in 1 hour, 1 day, or 1 month? Should it run on one machine or dozens?
These are the types of questions you'll need to deal with so if you get excited by the idea of ripping through TB's of data, and making it easy for your teammates to do the same, this job is for you
Good verbal and written communication skills in English
שלח קורות חיים למשרה הקודמת למשרה הבאה