argparse is a python library that allows us to write our own command lines to include flexibility in our code. I personally use it in many of my scripts to make my data pipelines more flexible and to form models that are on a moving time window for example. We’ll see some use cases after a quick look at the library.
First, we need to import the library.
import argparse
then we define a “Parser”
parser = argparse.ArgumentParser()
The ArgumentParser
object will hold all the information necessary to parse the command line into Python data types. …
Before starting anything to work with pandas-udf the prerequisite are
then we need to set up an environment variable for pyarrow to 1. (see import code)
sudo pip3 install pyarrow=0.14.1
Then we can proceed to libraries import.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf, PandasUDFType, sum, max, col, concat, lit
import sys
import os
# setup to work around with pandas udf
# see answers here https://stackoverflow.com/questions/58458415/pandas-scalar-udf-failing-illegalargumentexception
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"from fbprophet import Prophet
import pandas as pd
import numpy as np
If…
Collaborative filtering, KNN, Deep learning, transfer learning, Tfidf…ect explore all of these
In this article we will review several recommendation algorithms, evaluate through KPI and compare them in real time. We will see in order :
N.B : I was greatly inspired by the great notebook of Gabriel Moreira thanks to him https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101 some model or function like user profiler and evaluation function come from his notebook.
I will introduce the databases and define what will…
In this article, we will build a step-by-step demand forecasting project with Pyspark. Here, the list of tasks:
First we will import our data with a predefined schema. I work on a virtual machine on google cloud platform data comes from a bucket on cloud storage. Let’s import it.
from pyspark.sql.types import *schema = StructType([
StructField("DATE", DateType()),
StructField("STORE", IntegerType()),
StructField("NUMBERS_OF_TICKETS", IntegerType()),
StructField("QTY", IntegerType()),
StructField("CA", DoubleType()),
StructField("FORMAT", StringType())])df = spark.read.csv("gs://my_bucket/my_table_in_csv_format"…
I have some manga data, I even made an article so that you can collect this data set (with some modifications) see: https://towardsdatascience.com/scrape-multiple-pages-with-scrapy-ea8edfa4318
With this article, I will start a series of short tutorials on Pyspark, from data pre-processing to modeling. The first will deal with the import and export of any type of data, CSV , text file, Avro, Json …etc. I work on a virtual machine on google cloud platform data comes from a bucket on cloud storage. Let’s import them.
Spark has an integrated function to read csv it is very simple as:
csv_2_df = spark.read.csv("gs://my_buckets/poland_ks")#print it
csv_2_df.show()
Dans ce tutoriel nous allons développer un script pour récolter de manière structurer des données non structurées sur myanimelist
. Pour cela nous allons scraper plusieurs pages et sous pages pour construire un jeu de données complet.
Scrapy
est un framework open source et collaborative pour l'extraction de données sur le net.
Il y plusieurs types de librairie | framework qui nous permettent de faire du WebScraping. Notamment Scrapy, Selenium, BeautifulSoup, pour ne citer que les plus connus.
Scrapy est un outil créer spécifique pour effectuer des requêtes, scraper et sauvegarder des données sur le web il se suffit à lui…
In this article, we will develop a recommendation system based on unstructured data that are images. In order to have a fast, operational model without the laborious work of fine-tuning a Convolutional neural network
algorithm, we will use Transfer Learning
.
Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.[1] For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks
To make it even easier we will use the high-level API Keras
. Keras offers several pre-training…
In order to have a cleaner and more industrializable code, it may be useful to create a pipeline object that handles feature engineering. suppose we have this type of dataframe:
df.show()+----------+-----+
| date|sales|
+----------+-----+
|2018-12-22| 17|
|2017-01-08| 22|
|2015-08-25| 48|
|2015-03-12| 150|
+----------+-----+
Then we want to create variables derived from the date. Most of the time, we’ll do something like this:
Now, we want to integrate the creation of these variables into a spark’s pipeline and, in addition, put some safeguards in place before their calculations. …
Data scientist at Auchan Retail Data