Notes from PyConSK 2017
Notes, thoughts and conclusions from talks and workshops at PyConSK 11./12.03.2017:
What makes Silicon Valley software developers special, Pavel Serbajlo (Exponea CTO):
SV ~6% of us economy.
Presents Infos abt SV and core attitudes & mindset of biz there:
- Sharing infos abt challenges even btw direct competitors.
- Openess and diversity important.
Conclusion: culture is special.
Making monitoring boring, Michal Hanula (G Eng):
Great talk, cos with lots of kittens.
Building Data Pipelines with Python, Katharine Jarmul (Cons):
Py first class citizen in pipeline frameworks now. Most fw work with dag’s.
Simple to complex:
- Dask — w/ parallel
- Luigi & Airflow
- Stream processing FW
Airflow: built-in scheduler, state & status DB, tasks are func, dag’s via simple wrapper. Will become Apache project maybe soonish. Dag’s visualized f Doc. Tasks automatic via celery.
Luigi: Hadoop Integration. Start tasks manually, cron or self. Task visualizer build-in, b less doc like Airflow.
RT Stream Processing: PySpark, Python-Flink, G Dataflow, Data Pipeline (AWS) — less functionality. Word count, Tweets ex — PySpark streaming in Jupyter.
What to automate: preprocessing, repetitive tasks & reports, subtasks or subprocesses, basic analytic/dashboards, basic data validation.
Document what can’t be automated. Check if pieces can be automated.
Workshop: Machine learning with text in Scikit-learn, Peter Zvirinsky (Uni Prag):
Notebook in Python/ML/pycon_worskhop_2017. Text preprocessing and simple ML on corpora. Snippet:
# Since we have texts written in Czech in the dataset, let’s remove the accents (diacritics) from the text first.
nkfd_form = unicodedata.normalize(‘NFKD’, s)
ascii_string = nkfd_form.encode(‘ASCII’, ‘ignore’)
Django Channels and RT Data, Rachel Willmer (Dev):
Channels stand-alone project. For products at scale. Websockets build-in (Daphne). Redis Backend and local memory. Doesn’t provide persistent websocket.
Workshop: Data Wrangling in Python, Katharine Jarmul (Cons):
Notebook in Python/PyAna/data-wrangling-pycon. Standard wrangling techniques w/ Pandas.
Scrum sucks!, Martin Strycek (Team MGMT):
Whole process doesn’t work, better to pick some parts from it and aline it with team and culture.
How to build up Python community and empower women, Kristi Pogri & Jona Azizaj (OpenLab, Albania):
Physical space essential. Regional cross border collaboration good. Balance btw paid and volunteer staff. No gender balance, no community.
Most important: research cultural specifics.
Women: problem invisibility, exceptionalism, gender essentialism and soc expectations, sexualized env.
Initiatives: womoz, Fedora Woman, diversity team of Fedora. In hackerspace: Ada Lovelace day. Organizer of major conf mainly women.
How to encourage women:
- Recruit diversity
- Create Code of Conduct
- Value all contributions
- Organize events/confs
Human part of open-source, Adrian Holovaty (co-creater Django):
Behind everything are only normal people.
Red Hat — how to build diverse and entertaining open-source corporate culture, Jana Gutierrez Chvalkovska (Red Hat, HR):
Diversity important when you have diverse clients, for innovation, talent development and future progress.
Diverse employees have diverse needs.
How far can you go with python, Jakub Balas (Cons, Dev):
Far 😉 Esp in DS. Job-ads for Junior Devs are as funny as those for DS…
Object Calisthenics — 9 steps to better code, Pawel Lewtak (Dev):
DRY, KISS, SOLID, YAGNI, Zen of Pyhton.
Are programming exercises, OOP related. 9 rule:
- Only 1 level of indentation per method
- Don’t use the ELSE keyword
- Wrap all primitives and strings
- First class collections
- One dot per line
- Don’t abbreviate
- Keep all entities small
- No classes with more than two instance vars
- No Getters/Setters/Properties
Let’s solve cardiac diseases together: Python meets medical imaging and machine learning, Jan Margeta (Founder KardioMe):
With AI. IMG Recogn on CT/MR of heart w/ Deep Learning. Problem CV arrays to pattern (like human perception). CV modelling w/ keras. Takes lot of time. ML w/ sklearn. Simple ex — Flask to serve prediction, maybe in container.
Used python for prototyping and model dev. Loading model from C runtime (reworking in C++). Runs on GPUs.
- Iterating fast most important
- Repeatable data pipelines necessary
- No glory in data preparation
- Augment small datasets
- Annotate data w/ missing labels asap
For rare diseases there will be problems of recognition where you need human help, cos lack of data. For common problems it will be automatic.
Discovering Related Content with Python, Yian Shang (Data Eng, Vox):
Taking information from articles and create new content w/ it. PoC.
Tried TF-IDF, landed at word2vec using gensim. Extended word2vec -> article2vec. Mean each of features in article and sum. Cos similarity for ranking. Backend redis, built API w/ Flask. UI Slackbot.
Test: rec similar articles, eval click-through rate and feedback.
A documentation crash course for developers, Chris Ward (Tech Writer):
Docs are often first contact with project. Lack of doc mostly not problem, but quality -> Less is more; short, doesn’t nec simple. 3 questions before writing doc:
- Who are you writing for?
- What are they trying to achieve?
- Why are you writing this?
- Assume nothing
- Refine your concept(s) — communicate on point
- API docs are (not) always enough — Survey: Java devs only need API docs, JS devs need Quickstarts, py in middle?
- It’s not a manual
- Interactivity — not just text. Adding helps a lot
Studies: user doesn’t read docs, mostly scan content.
Consistency is most important — phrases, terms.
Jupyter: if you don’t use it yet you’re doing wrong, Christian Barra (Stat):
Jupyter dashboard a bit similar to shiny. Next v probably last w/ py2 support.
Great for fast prototyping. Python future for DS. Token based security since last v. Debugging is a pain point.
Jupyterlab will become new frontend. v1 should come out by end of year.
Topic Modelling with Gensim, Bhargav Srinivasa Desikan (Stud researcher):
Gensim for NLP — fast & easy.
Data and document organization with topic instead of keywords. Topics are collection of words that are written together. Algo LDA.
Simple eval: coloring all words by topic. Instead coherence model.