The Grug Brained Data Scientist

Advice
Author

Paul Simmering

Published

December 9, 2023

The Grug Brained Developer is a funny essay on advice for software developers. The lessons resonated with me. This is my own version, geared towards data professionals.

grug and the demons - made with DALL-E

Introduction

this collection of data science thoughts. good for young grugs that liked The Grug Brained Developer and now want more into data

grug data scientist not understand all but try many thing and fail and learn and do better over time and share what not awful

Complexity bad in data science too

data science much complex. many thing go wrong, invisible to grug

complexity bad, make grug’s brain hurt and cause mistakes that bite grug later

some complexity necessary to solve business problem. that is grug’s job. but grug must not add complexity that not needed

Data quality

data quality most important. if data bad, model bad. if model bad, prediction bad. if prediction bad, business bad, so no shiny rocks for grug

bad data is demon of data sciencing. is sneaky demon that hides in data and makes grug look bad, or worse, give bad info to business shamans. many shiny rocks lost to bad data

grug likes being close to data. but big brain data tools hide data and make it hard for grug to look at tables. grug like to look at tables. grug finds problems in data by looking at tables

grug work in data warehouse for years and when grug smells a stink, grug look at tables and find problem. when grug ignores stink and not look at tables, grug always regret

but projects have many tables and grug busy. so grug must automate look at tables. data quality framework check if data is missing, is in wrong format, or is out of range and if foreign keys are valid

best guarantee comes from enforced constraints in database. constraints always on guard and never sleep

but analytics databases are too lazy to enforce constraints. so grug must use data quality framework to check data. grug not like this but best grug can do

Data problem needs data solution

grugs tempted to use complex methods to fix problem of missing data and other stink. but better to fix at source

if data is bad, fix data

say again: if data is bad, fix data

to fix data, grugs need talk other grugs and business shamans. much wait. but must endure and fix data. tempting, use code to fix. very bad idea

Counting things

grug like to count things. when data quality nice, counting things already good enough to make business shamans happy. grug can count anything: users, orders, clicks, shiny rocks collected and more. grug can also separate counting by time, location, and other things

counting easy to do and fit into brain

Visualization

bar chart is grug’s best friend forever. grug can make bar chart of anything. easy for business shamans and grug to understand

complex chart like network graph or tree map or radar chart too hard to understand. message get lost in complexity

pie chart and word cloud look easy but cause misunderstanding. almost always better to use bar chart. sometimes business shamans ask for pie chart, and when pie has few slices, is ok. when pie has many slices, grug must say no

Machine learning

machine learning is powerful tool and unlocker of many shiny rocks. grug understands is not magic and not always best tool for job

big brains use complex machine learning models to solve problem that can be solved with simple model. like to show off big brain

this very bad because big model cost many shiny rocks for train and run. grug can’t look into big model to see what is doing and grug can’t explain big model to business shamans

some hard problem can only be solved with big model. then grug must use big model

grug likes reproducible model training and evaluation. grug and colleagues need to retrain models and compare. easy to forget settings and which data was used. brain limited. better to have tool that logs everything

last few years many big model change grug’s life. grug can now do things that grug could not before. big brains work very fast to make big model better and better. grug very happy about this and grug hope big brains keep doing this

grug prepares for new big model to change. grug knows: model come and go. model is not forever. new model will come and make old model look bad

Performance and productivity

when grug has to wait for model to train or database to query, grug gets bored and grug’s brain wanders. bad for grug’s productivity. make business shamans impatient too

data exploration and model experimentation is more fun when machine goes brrrr rather than when machine goes zzzzz. so when slow, grug uses performance profiling tools to find bottleneck

caching grug’s #2 best friend. grug ask for same thing many times. indexing also good friend

cloud development twisted concept. cloud scales in production - nice! but bad for developer experience. write code on laptop, package, upload, and wait for cloud to run. very slow and tiny bug that grug could fix in 1 minute takes long time. grug look for ways to develop locally or with quick feedback loop. setup can be headache but worth it!

Expanding the grug brain

grug’s brain too small and grug too busy to keep up with all new shiny toys. grug must choose which shiny toys to learn

popular data shamans have new toys every day and promise that new toys will solve all problems. grug not always believe this. but some tools are actually good. so grug must choose wisely

learn evergreen skills - always good idea. grug loves SQL because SQL was good for shiny rock collection for decades and will be good for long time more. many new toys use SQL so grug can use SQL with new toys

grug wants to have brain shaped like letter T. grug wants to know basics of many things and aspires to big brain in one thing

always need data quality and visualization and model evaluation. these are basic demon defense skills that every grug must have. cloud also good

to get more shiny rocks, grug must be extra good at one more thing, like model deepthink or huge data organization or business shaman rituals

some grugs identify by their tools. grug is wary of this. grug is grug, not Spark grug or Snowflake grug or AWS grug. when grug join new shiny rock mine, grug will use tool that other grugs use

Conclusion

good data better than complex pipeline