Analysis Workflow Template v1.3 (Data Science in Python (vs. R))

Geoffrey Gordon Ashbrook
Wooden Information
Published in
5 min readDec 12, 2019

In the interest of moving towards a discussion of best practice workflow for analyzing data:

1. State Goals:

- Where are you? What are you looking at?

- Where do you want to be? What are you looking for?

- How will you get there (to where you want to be)?

- vs. “solving a user problem”…it could be translated into this, but often is not.

2. Identify Standards and Best Practice that you will (aim to) adhere to:

- PEP8 (Python standards?)

- Unit Tests

- package / library version records documentation (readme)

- repeatability, auditability (?maintain-ability?)

- clarity on hypothesis, null-hypothesis, chosen benchmarks for significance(95%vs. 99%)

https://www.investopedia.com/terms/n/null_hypothesis.asp

3. Pick hardware and software on which to run:

- “local” (your computer hardware) or remote (over the internet)

- windows or POSIX (unix, BSD, MaxOS, Linux)

- vim and a terminal

- Google Colab (a remote debian virtual linux environment)

- Local jupyter notebook in anaconda-python

- IDE (integrated development environments)

- note: for Python-DS, the line between a text editor and an IDE is blurry,

in part because you will be running code sometime in a ‘python terminal’ (ipython or pipenv shell etc) and sometimes in a ‘normal command line terminal’

- Spyder in Anaconda on your Local Computer

- Virtual environments, docker containers, and AWS-EC2

- etc

4. Import (main, initial) libraries:

- best practice: document what all of your software package and library versions are

- os

- python

- conda

- pandas

- etc.

- maybe create a bundle of that software to be used later by other people

- maybe dockerize your environment so others can recreate it.

5. Get files/data sets/ etc

- obtaining data: source, scraping

- raw data: issues, file formats

- loaded data: dataframe, arrays, database

6. Initial Exploration 1: first observations, weed out issues

- did it load properly (or at all), header, etc.

- shape

- NaN (like null or empty values), (but formatted as float64)

- odd characters

- cardinality

- redundancy

- empty columns

- formatting of columns (may look like int but be string or vice versa)

- is it time-series?

7. initial exploration 2: basic patterns

- Use basic visualizations to analyze basic patterns in your data.

Tools for Explore:

- pandas profile

8. Cleaning Data (90% of time will be spent cleaning data)

9. Formatting, “feature engineering,” etc.

- “Tidy” Format (the name of a real standard)

https://en.wikipedia.org/wiki/Tidy_data

https://cfss.uchicago.edu/notes/tidy-data/

- Explanation: What is “feature” engineering?

- Literally: making a new column of data (a new “feature”) based on existing data

10. Features and Focus

- What is y (what you will predict)

- What X features should you include (/ exclude)

- note: “time-series for splits”

This may not come up until later, but think about what you are looking for and how to take things like the risk of ‘time leakage’ into account (see here)

https://medium.com/@GeoffreyGordonAshbrook/less-is-more-2070da966db7

- How to turn data like natural language data into X and Y data:

Explanation of a Natural Language Model

https://colab.research.google.com/drive/1n0QHVKLmjHhb1J0PVumoxq58-1OevP5b

11. Create and setup environments and virtual environments and begin organizing libraries packages and dependencies:

- virtual containers:

- pipenv shell

- conda env

- venv

- conda kernel

- pip vs. conda package installs

- best practice: record exact versions of all packages used. (e.g. requirements.txt)

12. Articulate your Hypothesis? What is your Null Hypothesis?

13.1. Establish A Baseline: Results to compare and score your model’s performance

(car race analogy)

(however narrowly your model wins over the baseline, making a ‘better’ or ‘more useful’ model is a binary measure of whether you have succeeded in making a successful model.)

Rules of Thumb for Setting a Baseline:

- For ordinal, continuous ‘y’ data:

e.g. mean, median, mode, etc.

- For categorical ‘y’ data:

The ‘majority “class”’ is your baseline. (a “class” is a category, majority means the most common/frequent one)

Python code to see majority class: >>> y_train.value_counts()[:1]

- (some say) Start with a kitchen sink “baseline model”:

a “kitchen sink” model as baseline is usually a basic model like a regression with ‘all feature columns’ included.

13.2 Get baseline “score”: (is this right?)

- pick score type:

- e.g. from confusion matrix:

- accuracy, (majority class mean), etc.

- precision

- recall

- F1 (great self explanatory name, huh?)

- MSE

etc.

14. Split your Dataset: Make your sets: Train, Val, & Test sets

- issue: random or time based split

- issue: split ratio

(for cross validation:

randomized_searchcv,

(not-preferred-gridsearchcv)

(also,bayes-search?(library?)

Explanation: You are running an experiment where you need to compare your results. You not only need to compare your model to a baseline, but you need to compare how the model works on ‘real data’ compared with the data that you used to train the model.

15. “Wrangle” Using a Function:

(for all: train,val,test)

Create a function to automate the process of cleaning handling and reshaping your data, so that all divisions of split data (and new future data?) can be handled in the same way. This systematizes what in the past you did through exploration.

Wrangle items:

- duplicates

- outliers

- drop empty data

- some “encoding”

- the formats of dates

- splitting times and dates into separate columns

- feature engineering

16. (Make) Your Family of Variables

Establishing a ‘family’ of standardized varibles will help regularize the process of making pipelines and running models.

e.g.

# Set Target & Features

target = ‘feature_column_that_is_y_value’

features = [‘colum_you_use_1’, ‘colum_you_use_1’, ‘etc’]

# Wrangle train, validate, and test sets in the same way

train = wrangle(train)

val = wrangle(val)

test = wrangle(test)

# Arrange data into X features matrix and y target vector

X_train = train.drop(columns=target)

y_train = train[target]

X_val = val.drop(columns=target)

y_val = val[target]

17. Pick what models you will try running. Try and compare multiple types of models if possible. Try to ‘argue’ why you have picked and not picked types of models based on your situation.

https://www.datasciencecentral.com/profiles/blogs/40-techniques-used-by-data-scientists

18. Pre-Pipeline phase:

Some say: Before doing a pipeline do a model without the pipeline with each step separately for debugging.

19. Pipeline: make a pipeline (e.g. sklearn) so that you can easily switch in and out different models and change parameters and hyperparameters and processes.

- If your data is in the form of a family of variables, you can easily run processes in a few short lines.

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X_transform(X_test)

20. Run Your Model(s): Proverbially “.fit .predict”

Once your data is organized, creating and running and evaluating your model made be unexpectedly brief:

e.g.

model = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=42)

model.fit(X_train_scaled, y_train)

model.score(X_val_scaled, y_val)

21. Evaluate, Compare, Refine, Redeploy Models, adjust hyperparameters, squeeze out large and small improvements.

Tools to evaluate model.

- confusion_matrix+

- PDP

(Note: the term ‘deploy’ may be used here different than in WEB-product-deployment)

22. Formally analyze the roles and relationships between the features columns.

- Are there interactions between columns?

- What columns have more of an effect?

- As in the case of time leakage, will some columns be disruptive and need to be discarded?

Tools:

- Shapely

https://towardsdatascience.com/how-to-explain-any-machine-learning-model-prediction-30654b0c1c8

23. Pick a final model or combination of models to use. (e.g. conventional models are often combined with neural networks)

24. Export the final “Test” or other final real-data results (using whole dataset to train)

When your results are the best you can achieve, fully train (and pickle) your model, and apply it to your final test or real world data.

25. Examine the proverbial “accuracy” or score of your model and “publish” your results and commentary.

- Articulate your conclusions in terms of your null hypothesis and the significance of your results.

26. Review and complete documentation of and for your code.

Another person’s account (with some overlap):

https://towardsdatascience.com/a-data-science-workflow-26c3f05a010e

--

--