#18 Paris Women in Machine Learning & Data Science: Knowledge Graph, Language Detection & Twitter
Our 18th meetup at Heetch will be unforgettable. After a snowy ⛄ ️meetup in January, we selected the hottest day of the year 🌞 to organize our 4rth “Hors-série” meetup!
💟 Several members of the WiMLDS Paris group will attend a summer camp organized by ESPCI to give young female students a taste of what a STEM career looks like (from June the 29th to August the 2nd 2019).
📺 The talk has been recorded and is available on our Youtube channel 📺
1️⃣ All the way from California, Sanghamitra Deb kick-started the evening with a talk about how to build a knowledge graph using weak supervision. After a Ph.D. in astrophysics, Sangha switched to Machine Learning and is now Data Scientist at Chegg, an edTech company.
Sangha presented a whole pipeline on how to build a knowledge graph. The top tips were:
- Use weak supervision. Labeled data is very expensive to get. As an alternative use other noisy sources to generate your dataset. In NLP a useful technique is using rules and heuristics. Sangha highlighted the Python package Snorkel that can be used for weak supervision.
- Use active learning to improve the accuracy of your model. For instance, if the classification probability of an example is below some threshold, send it to be labeled.
- Play with the threshold. Only classifying as positive examples with a probability above some threshold allows you to tune the threshold. A high threshold will improve the percentage of true positives at the cost of coverage.
2️⃣ We continued the evening with Cécile Chailloux, who presented another NLP task: High-performance language detection of web pages. Cécile is a semantic engineer at Dashlane, a company that secures and manages your passwords.
The task consists in detecting the language of “tiny” webpages (only tens of words) in a fast and secure way. Because of privacy and performance, the solution needs to avoid dependencies (no APIs).
Their solution for recognizing language is based on n-grams. Cécile stretched out that “Adhoc solutions are worth it. Don’t be too ambitious and be good at your specific task.” To illustrate this point she gave specific examples:
- Use your own dataset (and not Wikipedia) to reduce your scope. It improves your model on this scope, and it is faster to learn
- Many pages are multilingual, which makes detecting the language much harder. After a quick analysis, they could see that most multilingual cases are original language + English. The chosen solution is to label as English. A pragmatic solution that worked.
Finally, Cécile talked about new challenges, like non-alphabetic language recognition that will need character recognition.
3️⃣ The evening concluded with a panel discussion moderated by our own Natalie Cernecka about Twitter and how to use it. Mathilde Kurzawa, Morgane Dalbergue, and our very own Caroline Chavier shared do’s and dont’s with the audience.
- Don’t get lost. Remember you can use lists to separate subjects. You can track subjects that are important to you, like AI conferences.
- Brand yourself. Authenticity is important, also don’t forget to highlight other people’s expertise. If you want to grow your follower base visuals like emojis and gifs help 😊.
- Don’t be afraid to interact. People you are following in your field are usually nice and willing to interact if you message them. Morgane shared more tips in French here.
- About negativity: block and report. Don’t hesitate to support the person being abused.
To conclude, the WiMLDS Paris is wishing you an amazing summer 💛
If you want to keep posted about our activities, you are welcome to
📑check our Google spreadsheet if you want to speak 📣, host 💙, help 🌠
📍join our Slack channel for more discussions about machine learning, data science, and diversity in tech!
📩send an email to the Paris WiMLDS team to keep in touch >email@example.com
🔥 Feel free to share your company or lab’s job positions for free on WiMLDS’ website.
A special thanks to Mathilde Kurzawa and the Heetch team for their warm welcome, Betty Moreschini for live-tweeting and Morgane Dalbergue for replacing Marie Langé on the fly for the panel discussion!