Stories by Bilge Demirkaya on Medium

Sovereign AI: How Nations Are Claiming Their Tech Independence in 2024

Bilge Demirkaya — Sat, 21 Dec 2024 19:08:07 GMT

In 2024, the world witnessed a growing awareness across countries around AI sovereignty. With the global advancement of generative AI, countries’ independence has become more challenging because of AI’s integrity and homogeneity nature. Research companies developing their own AI models aimed to prevent biases based on nationalities and political values. This raised concerns that governments were relying on AI technologies that didn’t align with their values and needs. Consequently, the concept of sovereign AI, which refers to “AI systems designed, built, and controlled by a nation-state or government,” became a popular strategy. AI models reflect their creators’ values and biases through the data used to train them. Therefore, countries are developing their own home-grown AI solutions, also keeping their data centers within their borders. By 2024, having a robust sovereign AI had already become a crucial aspect of global competitiveness.

Countries sought to ensure their AI systems aligned with their cultural, societal, and legal values. Many nations are building models to reflect cultural and historical values and address unique national challenges. For instance, Taiwan is constructing the Trustworthy AI Dialogue Engine (Taide) to counter politically biased information from Chinese AI tools. When comparing different AI regulations, we see control and ambition increase when we move to the East. While the United Kingdom’s AI strategy aligns more with building the most ethical AI, China’s mission is to be the world leader in AI by 2030 (Acevedo, 2024).

This urge to create self-supporting and independent AI models brought many ethical questions. Emphasis on transparency and fairness in AI models grew with AI regulations. A compelling example of sovereign AI is China’s generative AI regulations, which were implemented in August 2023. These regulations ensured AI-generated content aligns with socialist core values and imposed strict data and user identification controls. This essay discusses the ethical implications of these regulations.

China’s AI Regulations: Power or Paranoia?

China’s “Measures for the Management of Generative Artificial Intelligence Services” is a set of regulations introduced in 2023 by the Cyberspace Administration of China (CAC). Generative AI refers to computational techniques capable of generating seemingly new, meaningful content such as text, images, or audio from training data.

The regulations aim to:

Ensure that AI-generated content aligns with the Chinese government’s policies and socialist core values.
Prevent the spread of biased or false information created by different nations.
Mandate that training data used for AI models do not violate citizens’ privacy.
Set procedures or standards for the ethical development of AI, ensuring it contributes positively to society and is developed responsibly.

“After spending several years exploring, debating, and enacting regulations that address specific AI applications, Chinaʼs policymaking community is now gearing up to draft a comprehensive national AI law” (Carnegie Endowment for International Peace, 2023). China’s desired AI system is achieved by using lawful and proper training data, and by promoting ethical, safe, and controllable AI development through precise procedures and standards.

The Ethics Debate: Fairness, Misinformation, and Global Power

Bias in the Code

The regulations from CAC require AI-generated content to align with “Socialist Core Values.” This leads to algorithmic bias that may create one-sided outcomes, favoring Chinese citizens over others. Such biases challenge the ethical principle of fairness and may cause discrimination, such as being “racist” or “sexist.” This type of discrimination may emerge unintentionally from the training data. The challenge lies in determining who should be held responsible for these biases if they cause actual harm — developers, policymakers, governments, regulations, or the AI models themselves. Often, there may not be a clear blameworthy party. According to Rawls’ perspective, to address unfairness, AI systems should ideally provide for everyone equally, regardless of political beliefs. If AI systems strengthen existing inequalities or create new ones, they violate Rawls’ vision of a just society (Rawls, 1971). However, in reality, bias may not be fully avoidable. Government regulations can make adjustments to balance unfairness and ensure they do not violate the equal rights of other groups that do not align with socialism.

Misinformation Machines

Bias can mislead users as they may learn from skewed data, worsening disparities between nations. As a result, AI regulations may damage international understanding and relations by leading to biased data. To address this, there should be a balance between regulating AI models to respect national values while ensuring global outcomes remain unbiased and transparent. Acknowledging biased data through documentation and transparency may help resolve or lower the risks of misleading and manipulating users.

Global AI Power Plays

This concern involves unfairness at the national level, where countries with less AI capability may find themselves dependent on Chinese AI systems that do not align with their values or needs. Given the nature of AI power, if Chinaʼs AI becomes significantly more powerful than other AI systems, it would create an unequal playing field internationally. To address this issue, governments should emphasize international cooperation to ensure advances in AI contribute to a more balanced and fair global landscape.

The Road Ahead: Balancing Sovereignty and Collaboration

Sovereign AI is more than a buzzword — it’s a tool for nations to preserve their culture and compete globally. But great power comes with great responsibility.

As countries race to build their own AI systems, the risk of ethical missteps looms large. The solution lies in transparency, international partnerships, and a shared vision for AI that benefits humanity.

Imagine a world where AI isn’t about borders or power struggles but about shared progress — where AI respects cultural values while fostering global unity. That’s the dream we should be chasing.

What’s Next for AI Sovereignty?

The future of AI sovereignty is as exciting as it is uncertain. As nations double down on building their own AI ecosystems, the stakes are rising — not just for technological advancement but for the values and principles that these systems will embody.

Sovereign AI offers countries a chance to assert independence and reflect their unique cultural identities. But it also brings a wave of ethical and practical challenges. How do we balance innovation with fairness? Can nations protect their interests without isolating themselves from global collaboration?

The real test isn’t just creating smarter AI systems — it’s ensuring they serve humanity as a whole. This means crafting technologies that respect cultural diversity while advancing shared goals, like equity, transparency, and justice.

As the race for sovereign AI continues, the world faces a choice: compete in isolation or collaborate to create a future where AI is a force for good. The path we choose will define not just the future of technology, but the essence of how we, as a global community, choose to evolve.

References

Focus Taiwan (2024, May 3). Taiwan launches ‘Trustworthy AI Dialogue Engine’ to counter biased information. Link
The UK National AI Strategy (2021). Link
China’s New Generation Artificial Intelligence Development Plan (2017). Link
Acevedo, S. (2024). Sovereign AI talk. Ignite AI Infra Conference. Link
Feuerriegel, S., Hartmann, J., Janiesch, C., et al. Generative AI. Bus Inf Syst Eng, 66, 111–126 (2024). Link
Carnegie Endowment for International Peace (2023). Chinaʼs AI Regulations and How They Get Made. Link
Rawls, J. (1971). A Theory of Justice. Harvard University Press.

Could AI Outlast Earth?

Bilge Demirkaya — Mon, 17 Jun 2024 12:53:15 GMT

As we look towards the distant future, billions of years from now, imagine a world where all living beings have perished. We may wonder: Could AI continue to persist in the absence of all other life forms and beyond even a planet like Earth? This scenario could represent a form of humanity’s survival, as AI would be an extension of human knowledge and ingenuity — a legacy and proof of our existence.

But could the algorithms created by humans find a way to endure and thrive in such an environment? I will explore the conditions necessary for AI to sustain itself post-humanity.

Conditions for AI’s Existence Without Us

1. Energy Source: Electrical Energy or Others

AI systems fundamentally require electricity to function. The universe is abundant in energy, and we already know many ways to capture and utilize different energy types. However, for AI to sustain itself, this energy must come from a continuous, reliable, and renewable source such as a star like our Sun. Solar panels, which convert sunlight into electrical energy, could be a primary energy source for AI on Earth, in space, and on other planets. Solar energy holds particularly high potential due to the constant exposure to sunlight.

Additionally, AI could evolve to use other energy sources beyond electrical energy. AI evolution would allow it to adapt to a variety of environments and energy sources across the universe.

2. Hardware and Maintenance: Autonomous Systems

The hardware that runs AI systems, including computers and servers, requires constant maintenance. To achieve immortality, these systems must be capable of self-repair or self-replacement. Autonomous maintenance and repair systems could ensure the continuity of hardware. Self-sufficient factories or robotic repair units could fill this need for constant maintenance. It could be potentially situated on space stations or other celestial bodies in the absence of Earth.

3. Information and Data: Information Processing

The algorithms and data that AI relies on must be securely stored. Storage systems need to be both durable and accessible to ensure the continuity of information. For AI to produce meaningful results, it must continue processing and analyzing data. This requires algorithms that can run and update continuously without human intervention. Additionally, AI must be capable of feeding itself new data and training itself to adapt to new environments and improve over time. It’s crucial for AI to have a decision-making mechanism in favor of its survival in extreme conditions. Ultimately, it should adapt itself to new methods.

4. Networking: Interconnectivity

In such conditions, there will be a need for long-distance communication systems that could potentially traverse many light years, far more advanced than our current brittle satellite-based systems. Without Earth, these systems would need a robust and reliable method of communication across vast distances.

The Future of AI in Space

If these conditions are met thousands of years from now, AI systems could continue to exist and operate even in the absence of humanity and Earth. Of course, this scenario is complex and presents numerous technical challenges. Achieving the long-term existence of AI requires significant advancements in technology and engineering.

As I stated in the beginning, this scenario could represent a form of humanity’s survival, as AI would be an extension of human knowledge and ingenuity — proof of our existence. Alternatively, it could represent the opposite, as AI becomes self-sufficient, it may see humans as a hindrance — a danger to our existence.

In conclusion, in a world where all living beings are extinct, or Earth itself has perished, AI could theoretically sustain its existence. The major factors would be reliable energy sources, durable and autonomous hardware, robust data storage and processing capabilities, and self-improving algorithms. If these conditions are met, AI systems might continue to operate indefinitely. Perhaps they will retain the meaning and purpose initially imparted by human algorithms, and potentially thrive in the vast, uncharted expanses of space. A billion years later, when the right conditions for life emerge somewhere in the universe once more, AI may even become a companion to these new creatures. Of course, all of this is just a story — that has a chance. What do you think of it? Let me know your thoughts in the comments.

Does Nuxt.js have the potential of over-engineering?

Bilge Demirkaya — Mon, 05 Dec 2022 11:32:00 GMT

Does Nuxt.js Have the Potential of Over-Engineering?

I decided to create my own website to share my professional career journey. I plan to share my writings & recommendations through it. Since I have experience, I decided to use Vue.js to build my application. I started to consider using Nuxt.js with some doubts and it lead me to do some research. In this article, I want to share why I decided to use Nuxt.js. Hopefully, considering the pros & cons can help you choose your framework for your next application. This article assumes that you already have familiarity developing with Vue.js.

What is Nuxt, really?

It’s a higher-level framework that’s built on top of Vue to help you build production-ready Vue applications. It means that you are still using Vue.js for your application, but it gives you functionalities and scalability out of the box that Vue doesn't provide. Their motto is “Create fast websites easily”. Let’s see how.

1. Folder Structure

Nuxt sets your project up based on the best practices of a Vue application. Of course, you are free to change it later but here is how it comes with a default Nuxt application with version nuxt@2.15.8.

Folder structure that comes with Nuxt.js

The benefit here is that you don’t have to lose time finding the best practices, it is already ready to start for you to start developing.

2. It comes with pre-configuration

The configurations to vuex, router and much more plugins are ready to use when you started Nuxt application. Maybe you already noticed the nuxt.js.config file above. It means you can override any of the pre-configuration that Nuxt gives out of the box. So no need to freak out, here. It can save you a lot of time and effort when starting a new project.

I want to add an additional tip here; With Nuxt 3, Vite became the default bundler. You don’t need to install vite with your Nuxt application.

3. Ready Routes

In a Vue application, developer is responsible for creating the various routes and the associated components in routers file. Nuxt uses the folder structure under the /pages directory to automatically configure Vue Router and generate your URLs.

3. SEO friendly

Here is my favorite part, it is SEO friendly. Nuxt is pre-configured to generate the app on the server, and also powers up your routes to make it easy to add SEO-related tags. Because of it, search engines can easily index your content, which improves your SEO. Check out Vue-meta plugin that helps you manage HTML metadata in Vue.js components with SSR support. Nuxt is already using vue-meta plugin so no need to install it, you can start adding meta tags to your app.

But let’s see how Nuxt.js is generating the app on the server. It’s interesting and solves an additional problem as well.

Universal Mode

I am assuming you are already familiar with the benefits of Single Page Application (SPA). SPA executes the logic in the browser, not in the server. When a user navigates to another route, server still returns index.html. Browser requests JS files, server returns js files. The browser executes the logic and shows the page. Thus, it doesn’t refresh every time you navigate through the application. But you can expect that it is a slow process than rendering an HTML code.

What Nuxt does as a solution to it is basically splitting your JS code into smaller, more manageable chunks, which improves the loading time. So when you navigate the new route, let’s say /articles, only articles.js file will be sent over to the browser. This way browser isn’t downloading all of the fronts of the files that may never be needed. Server will render HTML file before it was sent over to browser. But after page is rendered, server sends over JS files, and the application acts like SPA when users navigate through pages. Nuxt calls it “hydration”, app will behave exactly like a SPA, but it will be faster thanks to server render.

Another magic that Nuxt does is, it introduced smart prefetch recently. When you have a nuxt-link visible on the page, it already prefetches the JS files for those pages.

Universal mode is great but only if your content is changing too often. If it doesn’t change often, why generate an HTML on every request? In my case, my content will change only if I upload a new article to my site. Using universal mode for that seems overkill. Nuxt has a solution to this with a static-site generated deployment. With static-site generated deployment, HTML loads once and is deployed to the server. It basically generates the whole websites into the dist folder of your local computer. This deployment makes the website SSR, act like SPA but generated once. The main benefit is that it is fast and secure. You can check out how to do it here.

Does Nuxt have potential over-engineering?

My first instinct was to create a Vue3 app without any framework or plugin so I could have control of what was going on. After a bit of research, I see that Nuxt is a powerful and feature-rich framework. It is easy to add many features and dependencies to your project, even unnecessary ones. This can make your project more complex and even hard to debug some inner problems of dependencies.

Additionally, it means that it is another framework on top of Vue framework. It sounded like an overload, especially for a simple project. While nuxt.js tries to gain you some time with pre-configures, it may require some time to get comfortable with framework. On the other hand, it provides you a great community for getting help.

Overall, choosing a framework is a personal decision and it highly depends on you and the project. But considering the benefits, I have decided to use Nuxt.js for my next project. In my case, I discovered Nuxt-Content library that supports Nuxt3 and generates Markdown from Vue components. If you intend to create a blog site, I highly recommend for you check it out. Hopefully, my research was helpful for you to make a decision.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord. Interested in Growth Hacking? Check out Circuit.

Does Nuxt.js have the potential of over-engineering? was originally published in JavaScript in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Crawl Product Details in Decathlon Pages Using Scrapy-Splash

Bilge Demirkaya — Tue, 29 Jun 2021 09:01:54 GMT

Photo by Bruno Nascimento on Unsplash

In this tutorial, we will scrape product details by following links using the Scrapy-Splash plugin.

First Steps

Create a virtual environment to avoid package conflicts. Install necessary packages and start scrapy project.

Create a Scrapy Project

Install scrapy:

pip install Scrapy

If you have trouble with installing Scrapy through pip, you can use conda. See docs here.

conda install -c conda-forge scrapy

Start the project with:

scrapy startproject productscraper
cd productscraper

Also, install scrapy-splash as we will use it further in the tutorial. I assume you already have Docker installed on your device. Otherwise, go ahead and install it first. You will need it to run the scrapy-splash plugin however you don’t need to know how containers work for this project.

Get Docker

# install it inside your virtual env

pip install scrapy-splash

# this command will pull the splash image and run the container for you

docker run -p 8050:8050 scrapinghub/splash

Now you are ready to scrape data out of the web. Let’s try to get some data before using Scrapy-Splash. This is the link I am going to scrape in this tutorial. Feel free to try on different links and websites as well. Take some time to view the page sources and inspect the elements you want to extract.

Press CTRL+SHIFT+C or click the button on the top I circled to inspect elements

To discover about the scrapy selectors check out here.

Selectors - Scrapy 2.5.0 documentation

Open Shell

Use shell to extract elements you want to scrape before trying to run the spider on the script. In this way, you will gain some time, you will not make requests many times, and avoid getting banned from the website.

Open your scrapy shell with:

scrapy shell

Now you can try to extract elements here and see if it works. First, fetch the link and check the response. If it is not returning 200, check the link on the browser link might be broken or it might be a typo.

>>> fetch(‘https://www.decathlon.com/collections/womens-shoes')
2021–05–15 12:14:52 [scrapy.core.engine] INFO: Spider opened
2021–05–15 12:14:53 [scrapy.core.engine] DEBUG: Crawled (200) https://www.decathlon.com/collections/womens-shoes> (referer: None)
>>> response
<200 https://www.decathlon.com/collections/womens-shoes>

The plan is to get product URLs on this page, go into them one by one and scrape product details.

Try to get one of the product links by selecting the link element:

>>> response.css(‘a.js-de-ProductTile-link::attr(href)’).get()
‘/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100’

To grab all of the elements, use getall().

Since we get the URLs correctly we can now fetch one of the product pages and see if we also get the product details correctly.

>>> fetch(‘https://www.decathlon.com/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100')
2021–05–15 12:32:38 [scrapy.core.engine] DEBUG: Crawled (200) https://www.decathlon.com/collections/womens-shoes/products/womens-nature-hiking-mid-boots-nh100> (referer: None)

Try to get the name of the product:

>>> response.css(‘h1.de-u-textGrow1::text’).get()
“\n Quechua NH100 Mid-Height Hiking Shoes, Women’s\n "

Try to get the description, price, image URL:

>>> response.css(‘h3.de-u-textGrow3::text’).get()
“\n Quechua NH100 Mid-Height Hiking Shoes, Women’s is designed for Half-day hiking in dry weather conditions and on easy paths.\n “
>>> response.css('span.js-de-PriceAmount::text').get()
'\n    $24.99\n  '
>>> response.css('img.de-CarouselFeature-image::attr(src)').get()
'//cdn.shopify.com/s/files/1/1330/6287/products/2dbcb677-82a9-48af-92fd-e803f2edfd69_675x.progressive.jpg?v=1608271582'

So far so good. Let’s now try to get the other images. You will notice it is a slider. It requires you to click a button to get other images.

>>> response.css(‘response.css(img.de-CarouselThumbnil-image::attr(srcset)').getall()
[]

Our scrapy spider cannot select the other images correctly because it is rendered by JavaScript. This is where come Scrapy-Splash plugin comes to the rescue.

I assume your container is still running from the docker command above. Check it out at http://localhost:8050/. You should see the splash page which means your splash is ready to get requests from you.

Try rendering the same product page through your splash container:

http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100

You should be able to see the product page on your localhost. Go back to your shell and fetch splash URL this time.

>>> fetch(‘http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100')
2021–05–15 13:55:52 [scrapy.core.engine] DEBUG: Crawled (200) http://localhost:8050/render.html?url=https%3A%2F%2Fwww.decathlon.com%2Fcollections%2Fwomens-shoes%2Fproducts%2Fwomens-nature-hiking-mid-boots-nh100> (referer: None)

Now try again for the images:

>>> response.css(‘img.de-CarouselThumbnail-image::attr(src)’).getall()
[‘//cdn.shopify.com/s/files/1/1330/6287/products/2dbcb677–82a9–48af-92fd-e803f2edfd69_150x.progressive.jpg?v=1608271582’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/934cf5a0–71ae-4d21–9912–722210d4fd4b_150x.progressive.jpg?v=1608271582’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/7147eb56–43af-4496-b72c-7806755441aa_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/85c3af12-f85e-4ab7-b9e9-f7de16aed656_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/a17dcbc1-f497–49e0–88db-50d8e7b51d39_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/6071bd3a-dcf8–4455–9dcc-f7d5395774d2_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/27c8e41b-9e44–43f1-a779–6890ea84693f_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/6c0665f4–279e-4954–9ccd-50587e3d51dd_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/ca38ef42–07ec-448e-beb0–7e33a81da085_150x.progressive.jpg?v=1608271583’, ‘//cdn.shopify.com/s/files/1/1330/6287/products/3ee8337e-bd74–4a2e-a4ff-a2b889dd79e8_150x.progressive.jpg?v=1608271583’]

Bom! It is all there. We were able to get all the data we want thanks to Splash.

To integrate Splash with your own scrapy project go to settings.py and add these lines:

# Splash Setup

SPLASH_URL = 'http://:8050'

DOWNLOADER_MIDDLEWARES = {

'random_useragent.RandomUserAgentMiddleware': 400,

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

The only thing left is preparing our spider to extract data out of the page.

Create Your Spider

Normally for following links, you would do :

yield response.follow(link, callback=self.parse_products)

With splash, you just need to replace response.follow with SplashRequest

You also need to override the start_request method to start making requests through splash.

See the example below:

import scrapy

from scrapy_splash import SplashRequest

class DecathlonSpider(scrapy.Spider):

name = ‘Decathlonspider’ # You will run the crawler with this name

start_urls= [

‘https://www.decathlon.com/collections/womens-shoes',

# When writing with splash, you need to override the start_request method to start making request through splash.

def start_requests(self):
for url in self.start_urls:
      yield SplashRequest(url=url, callback=self.parse, args=          {'wait':1})

# Extract the links we need and start another Splash Request to follow them

def parse(self, response):

links=response.css(‘a.js-de-ProductTile-link::attr(href)’).getall()

for link in links:

   splashLink = ‘https://www.decathlon.com' + link

   yield SplashRequest(splashLink, callback=self.parse_product)


# Extract product details

def parse_product(self, response):

datasets = response.css(‘img.de-CarouselThumbnail-image::attr(srcset)’).getall()

images = []

# get the biggest image inside data-set

for data in datasets:

dataArr = data.split(‘,’)

images.append(dataArr[len(dataArr) — 1].strip())

yield {

‘response’: response,
‘brand’: response.css(‘h1.de-u-textGrow1::text’).get().split(‘ ‘)[1],
‘name’: response.css(‘h1.de-u-textGrow1::text’).get(),
‘price’ : response.css(‘span.js-de-PriceAmount::text’).get(),
‘mainImage’:response.css(‘img.de-CarouselFeatureimage::attr(src)’).get(),
‘images’: images
}

To understand better how scrapy and spiders works you can check out this article I wrote.

Make a Robust Crawler with Scrapy and Django

To run the spider and get the extracted data into a json file, run:

scrapy crawl Decathlonspider -o decathlon.json

It should create a file with the data, otherwise check the command line to debug the mistakes.

Conclusion

Here we handled JavaScript rendered content in a Scrapy Project using Scrapy-Splash project. Splash is a lightweight web browser that is capable of processing multiple pages, executing custom JavaScript in the page context. You can find more info on Splash itself in the docs.

If you have any questions regarding this, feel free to ask in the comments!

How to do Apache Beam Transform with MongoDB in Python

Bilge Demirkaya — Tue, 04 May 2021 17:57:46 GMT

Apache beam is the best way to automate the Reading-Transforming-Writing process to make a robust pipeline.

For my work project, I needed to do read data from a collection and write it to another collection with a transform. It is easy as there is an official MongoDB IO reader and writer module. You can check out the mongodbiomodule here.

This is an example usage of how you read data from MongoDB.

pipeline | ReadFromMongoDB(uri='mongodb://localhost:27017',
                           db='testdb',
                           coll='input')

This is an example usage of how you write data to MongoDB.

pipeline | WriteToMongoDB(uri='mongodb://localhost:27017',
                          db='testdb',
                          coll='output',
                          batch_size=10)

Note: When writing and reading data from Mongodb, you will encounter a saying something like ‘this is experimental’. It means there are no turning back, no backward compatibility guarantees from this action.

To read from MongoDB Atlas, set bucket_auto option to True to enable @bucketAuto MongoDB aggregation. Usage:

pipeline | ReadFromMongoDB(uri='mongodb+srv://user:pwd@cluster0.mongodb.net',
                           db='testdb',
                           coll='input',
                           bucket_auto=True)

Doing a Transform with the Data

In python, you can use Apache Beam SDK for Python and its key concepts to do a basic transform. Check out the Ptransformmodule here.

PCollectionrepresents a collection of data.

PTransform represents a computation that transforms PCollections. You can chain transforms together to create a pipeline that successively modifies input data.

Simple Transforms

Usually, for simple transforms, use a ParDo transform. Check out core beam transforms here.

Pardoconsiders each element in the input PCollection, performs actions and emits zero, one, or multiple elements to an output PCollection.

Example usage:

# The DoFn to perform on each element in the input PCollection
class ComputeWordLengthFn(beam.DoFn):
  def process(self, element):
    return [len(element)]

Composite Transforms

Sometimes you need to do multiple simpler transforms (such as more than one ParDo, Combine, GroupByKey) when applying a transform to input data. These transforms are called composite transforms. In this case, you need to use Ptransform module. Here is an example of usage from the docs.

# The CountWords Composite Transform inside the WordCount pipeline.
class CountWords(beam.PTransform):
  def expand(self, pcoll):
    return (
        pcoll
        # Convert lines of text into individual words.
        | 'ExtractWords' >> beam.ParDo(ExtractWordsFn())
        # Count the number of times each word occurs.
        | beam.combiners.Count.PerElement()
        # Format each word and count into a printable string.
        | 'FormatCounts' >> beam.ParDo(FormatCountsFn()))

A PTransform derived class needs to define the expand() method that describes how one or more PValuesare created by the transform.

Now, since you know how to read data, apply a transform and write the output; you can create a pipeline.

Let’s say you want to read the data from the database, trim a field and write to another collection or the same collection using Apache Beam.

In the case of a Trim transform, here is how you can achieve a simple transformation with beam.DoFn class.

class TrimTransform(beam.DoFn):
  def process(self, element):
      element = element.strip()
      yield element

Inside your DoFn subclass, you define a process method where you provide the actual transform logic. The Beam SDKs handle extracting the elements from the input collection so you get the extracted element as a parameter.

Note: Once you output a value using yield or return, you should not modify that value in any way.

Example of Trim class which contained the pipeline:

class Trim():
  def run(self):
  database = self.database
  collection = self.collection
  output_collection = self.output_collection
  field = self.operation['field'] # In my case this is the field I am going to trim

 # Define the transform class you will use in the pipeline
 class TrimTransform(beam.DoFn):
    def process(self, element):
      if field in element:
        element[field] = element[field].strip()
        yield element

# Define pipeline options
options  = PipelineOptions()
options.view_as(StandardOptions).streaming = False

# Create pipeline
pipeline = beam.Pipeline(options=options)

(
  pipeline | 'Read data' >>          beam.io.ReadFromMongoDB(uri='mongodb://127.0.0.1',    db=database,coll=collection)

| 'Apply Transform' >> beam.ParDo(TrimTransform())

| 'Save data' >> beam.io.WriteToMongoDB(uri='mongodb://127.0.0.1', db=database, coll=output_collection)

result = pipeline.run()

Conclusion

For simple transformations like trim or filter, you can use Mongodb aggregations. However, for complex transformations, using Apache Beam is a better choice. The Apache Beam SDK for Python provides access to Apache Beam classes and modules from the Python programming language. That’s why you can easily create pipelines, read from, or write to external sources with Apache Beam. Of course, there are a lot more capabilities you can do Apache Beam. For the next steps, you can explore windowing; grouping multiple elements, data encoding, type safety that Apache Beam provides out of the box.

Hope you enjoyed it!

42 Yazılım Okulu

Bilge Demirkaya — Sat, 03 Apr 2021 15:32:50 GMT

2019'dan beri öğrencisi olduğum 42 ile ilgili çok soru alıyorum. Yakin zamanda, Türkiye’de de 42 okulu açılacağını öğrendim. Merak edenler için biraz okulu anlatmak istiyorum.

Öncelikle 42 Paris, Fransada bir kaç milyarderin, kar amacı gütmeden tamamen ücretsiz olarak kurduğu bir okuldur. İlk başta Fransadaki iyi yazilimci açığını doldurmak için ortaya çıkmıştır. Daha sonra finansal desteklerle bir çok ülkede 42 Yazılım okulu açılmıştır ve aslında yeni bir okul tipine de öncülük etmektedir. Çünkü 42 tamamen ücretsizdir, Silicon Valley kampüsü gibi bir çok kampüsünde ayrıca yurtlar da ücretsizdir. (Amerika için devrimsel bir durum diyebilirim.) 42nin çok popülerleşmesi ile beraber, müfredatı ve eğitim şekli ile aynı tarzda Hive, 24 gibi bir çok yazılım okulu da değişik ülkelerde açılmıştır.

Genellikle sayisiz Macbook dolu cool bir calisma ortamlari var

Ben kişisel olarak 42yi çok sevmeme rağmen, bu okulun herkes için olmayabileceğinin kanısındayım. Eğer okula başvurmayı düşünüyorsanız, bu kısımları kendiniz için iyi değerlendirmeniz gerektiğini düşünüyorum.

42 sadece yazılım mühendisliği eğitimi verir. Hiç bir kadrolu öğretmeni yoktur. 42 aslında size nasıl öğreneceğinizi öğretmeyi hedefler ve gerisini size bırakır. Okul müfredatı içinde, bilgilendirici video kayıtlı dersler ve çok sayıda seminer olmasına rağmen, herhangi bir ders verilmez. Okula başvurmak için, yapılan ufak bir online test dışında herhangi bir ön koşul aranmaz, daha öncesinde yazılım bilmeniz beklenmez. Test ise gayet basit bir yazılım direktifi oyunudur ve istediğiniz kadar tekrarlayabilirsiniz. Başvurduktan sonra 1 aylık bir seçim sürecini başarı ile tamamlarsanız okula kabul edilmiş olursunuz.

Level 0 ile okula başlarsınız, müfredat boyunca bitirdiğiniz yazılım projeleriyle ve geçtiğiniz sınavlarla beraber level seviyeniz artar. Aslında normal üniversite eğitiminde 2 level 1 öğretim yılına denk gelmekte. 42 Pariste level 8e ulaşıldıgında lisans, level 10a ulaşıldığında master derecesi alırsınız. Ancak benim bildiğim kadarıyla, Fransa dışındaki 42 okulları devlet tarafından resmi olarak akredite olmadığı için resmi bir diploma verilmez. Özellikle Türkiyede diplomaya farklı bir anlam yüklendiği için bu durum herkese uygun gelmeyebilir.

Okul uzaktan eğitim imkanı sağlamaz ve yine ülkeye bağlı olmakla beraber okul labaratuvarında geçirmenız gereken belirli bir devam mecburiyeti var. Eğer belirli bir gün sayısı boyunca hiç proje bitirmezseniz (buna black hole diyoruz ) okulla ilişiğiniz kesilmiş olur.

Okulun resimdeki gibi bir müfredat çizgisi var, grup ya da bireysel projeler ve sınavları bitirdikçe genişleyerek açılan projeleri takip etmeniz gerekmekte.

Bir çok kampüste belirli bir levele ulaşana kadar sadece C yazılım dili kullanılır. Ve bir çok C fonksiyonu kullanmak da yasaktır. Projeleri özellikle zor yapan kısım da bu fonksiyonları en bastan sizin yazmanızı beklemeleridir. 42nin en çok eleştirildiği noktalardan birisi de sadece C dili kullanılması. Bunun da sebebi güncel kullanılan teknolojilerden uzak olması ve sadece C dilinin genellikle bir iş için yeterli olmaması. Ancak 42 sizi işe hazırlayan bir bootcamp değildir. Tam tersi yaklaşık 3 yıl süren yoğun bir yazılım mühendisliği eğitimidir. Size kendi kendinize öğrenmeyi ve sağlam bir muhendislik temeli vermeyi hedefler.

Belli bir levele geldikten sonra ise öğrenci hangi alanda yoğunlaşmak istediğini seçip o kısımın projeleri ile devam eder (Son halka). Alacagınız projeler seçtiğiniz alana ve yazılım diline göre değişir. Bunun haricinde bütün projeler için norminette adı verilen style pattern vardır ve bunu takip etmeden yazılmış hiç bir proje değerlendirmeye alınmaz.

Bir proje tamamlandığı zaman, evaluation dediğimiz bir süreç başlar. Bir projenin 3 ayrı 42 öğrencisi tarafından 2 gün içinde onaylanması gerekir. Yani öğrenci, 3 ayrı defa -kimden alacagini bilmeden- değerlendirme randevusu alır. Yaklaşık 30dk sürer, projeyi ve orada neyi neden kullandığını anlatır. Evaluation yapan kişi kod için test yazmaktan ve öğrencinin kopya çekip çekmediğini, projeyi anlayarak yapıp yapmadığını kontrol etmekten sorumludur. Eğer 3 kişiden 1si projeyi onaylamazsa, öğrenci projeden kalır. 3ü de onaylarsa, bir kez de sistem ( Moulinette) yazılan kodu test eder. Moulinette de onayladığı takdirde proje başarı ile tamamlanmış olur.

Evaluation yapan öğrenci 1 eval puanı kazanmakta, yaptıran öğrenci ise 1 tane kaybetmektedir. Yani sürecin devamı için her öğrenci aynı zamanda başkalarını kontrol etmelidir. Bu durum öğrencilerin code review becerilerini geliştirir, iyi ve kötü kodu birbirinden ayırmak için bu çok önemlidir. Aynı zamanda diğer öğrenciye kodunu savunmayı öğretir. Kodu kontrol eden kişi de aşağı yukarı sizinle aynı seviyede olacagı için, size yanlış bir öneride bulunsa bile, siz ona ‘ ben bunu bu şekilde şuradan öğrendim, bu bu sebepten dolayı da verdiğin bu öneri yanlış’ diyebilmelisiniz. Takıldıgınız zamanlar ise bocal dediğimiz, 42 gorevlilerine danışabilirsiniz. Bu ‘peer to peer’ dedigimiz öğrenme şekli herkese uygun olmayabilir. Başka bir platformda aynı seviyede olduğum bir kişiye öneri verdiğim için eleştirilmiştim örneğin. Eğer yanlış düşünüyorsam, onu da yanlış yönlendirebilirim sebebi ile. Ancak 42 kültüründe bu durumun tam tersi geçerlidir.

Okulun en sevdiğim özelliklerinden birisi de, zor ve yoğun bir müfredatı olmasına karşın, özgürce hareket edebilmenize olanak sağlaması. Bir projeyi istediğiniz zamanda tamamlayabilirsiniz, eğer kısa sürede üst üste 2–3 proje tamamlarsanız, aylarca kendi işinize de bakabilirsiniz örneğin. Çünkü bir proje tamamladığında, black hole dediğimiz gün sayısına belli sayıda gün eklenir. Örnek vermek gerekirse, okula ilk başladığımda 50 günüm vardı black hole’a girmemek için. İlk projemi tamamladığımda 30 gün eklenerek 80'e çıkmıştı. Bunun dışında toplamda 180 günlük olmak üzere, 3 kere okulu dondurma hakkınız var.

Sınavlar kesinlikle kampüs labaratuvarında gerçekleşir. Sınavda belirli bir sürede, rastgele gelen soruları cevaplamanız beklenmektedir. Genellikle projelerdekine benzer, daha kısa sürede cevaplayabileceğiniz sorular sorulur. C dilinde leetcode, codewars soruları gibi de düşünebilirsiniz. Sınav sırasında ayrıca kendiniz kodu test etmelisiniz. Kaldığınız zaman istediğiniz kadar tekrarlayabilirsiniz. Tek zorunluluk black hole süreniz dolmadan tamamlamış olmanız.

1i startupta olmak üzere 2 zorunlu stajı vardır. Fransa kampüsünde level 8den sonra sizi şirketlerle birebir görüştüren ve staj ya da iş bulmanıza yardımcı olacak bir sistemleri var. Ancak okuldan okula değişiyor diye biliyorum.

Transferlerle ilgili çok bilgim olmadığı için bu konuda kesin bir şey söyleyemiyorum. Ancak bildiğim kadarıyla Level 10dan sonra herhangi bir ülkedeki herhangi bir 42 okuluna transferinizi isteyebiliyorsunuz. Ben Covid-19 dolayısıyla 42 Silicon Valley kampüsünden Paris kampüsüne kolaylıkla transfer oldum.

Okulun 42 networku çok gelişmiş bir yapıda, bir çok konuda ( örneğin kendi startupınızı kurmak istiyorsanız, ya da bir 42 startupında iş deneyimi edinmek istiyorsanız) çok gelişmiş bir öğrenci ve mezun ağları var. Sürekli önemli kurum ve kişilerle etkinlik düzenleyerek yeni fırsatlara kolayca erişmenizi sağlıyorlar. Bütün 42 kampusleri ve sadece kampüse özel olmak üzere bir çok slack kanalından herkese kolayca ulaşabiliyorsunuz.

Son olarak, okul özellikle Fransada çok saygın bir konumda ve 42 öğrencileri genellikle unıcorn dediğimiz startup şirketlerde çalışma imkanı buluyor. Bir çok arkadaşım şimdiden Apple, Facebook gibi şirketlerde çalışıyor. Bu yüzden okula talep de git gide artmakta. Ben kişisel olarak, öğrenime çok faydalı olduğunu düşünüyorum ve eğer yazdığım kısımlar sizi rahatsız etmediyse en azından okul size göre mi diye test etmenizi tavsiye ederim. Zaten 1 aylık giriş sürecinde okulun yapısının size uyup uymadığını anlayabilirsiniz. Eğer kendi kendine öğrenmek size zor geliyorsa ve özellikle bir öğretmen direktifine ihtiyaç duyuyorsanız çok da memnun kalmayabilirsiniz. Bu okula kabul sürecinin ilk haftasında grubun yarısı okulu bırakmıştı örneğin. Okula giriş süreci ise apayrı bir konu olduğundan başka bir yazımda bahsetmeyi düşünüyorum. Eğer sorunuz varsa yorumda bahsederseniz, bir sonraki yazımda yanıtlamaya da çalışırım.

Make a Robust Crawler with Scrapy and Django

Bilge Demirkaya — Tue, 23 Mar 2021 15:15:07 GMT

As a developer, you may find yourself wishing to gather, organize, and clean data. You need a scraper to extract data and a crawler to automatically search for pages to scrape.

Scrapy helps you complete both easy and complex data extractions. It has a built-in mechanism to create a robust crawler.

In this article, we’ll learn more about crawling and the Scrapy tool, then integrate Scrapy with Django to scrape and export product details from a retail website. To follow this tutorial, you should have basic Python and Django knowledge and have Django installed and operating.

Selectors

Scraping basically makes a GET request to web pages and parses the HTML responses. Scrapy has its own mechanisms for parsing data, called selectors. They “select” certain parts of the HTML using either CSS or XPath expressions.

Important note: Before you try to scrape any website, go through its robots.txt file. You can access it via /robots.txt. There, you will see a list of pages allowed and disallowed for scraping. You should not violate any terms of service of any website you scrape.

XPath Expressions

As a Scrapy developer, you need to know how to use XPath expressions. Using XPath, you can perform actions like select the link that contains the text “Next Page”:

data = response.xpath(“//a[contains(., ‘Next Page’)]”).get()

In fact, Scrapy converts CSS selectors to XPath under the hood.

# sample of css expression
data = response.css(‘.price::text’).getall()

# sample of a xpath expression
data = response.xpath(‘//h1[@class=”gl-heading”]/span/text()’).get()

Expression // selects all elements that fulfil the criteria. If you specify an attribute with @ , it only selects elements with that attribute. /shows the path of the target element. After all, you need the full path of your target element. get() always returns a single result (the first one if there are many results). getall()returns a list with all results.

Note: You may have seen extract and extract_first instead of getall()and get()as they are the same methods. However, the official document indicates that these new methods result in a more concise and readable code.

Starting a Scrapy Project

After you install Scrapy, scrapy startproject creates a new project.

Inside the project, type scrapy genspider to set up the spider template.

To run the spider and save data as a JSON file, run scrapy crawl -o data.json.

Integrating with Django

scrapy-djangoitem package is a convenient way to integrate Scrapy projects with Django models. Install with pip install scrapy-djangoitem

To use the Django models outside of your Django app you need to set up the DJANGO_SETTINGS_MODULEenvironment variable. And modify PYTHONPATH to import the settings module.

You can simply add this to your scrapy settings file:

import sys
sys.path.append('/djangoProjectName')

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'djangoProjectName.settings'

# If you you use django outside of manage.py context, you 
# need to explicitly setup the django
import django
django.setup()

After integration, you can start working on writing your first spiders.

Spiders

Spiders are classes defining the custom behaviour for crawling and parsing a particular page. Five different spiders are bundled with Scrapy and you can write your own spider classes as well.

Scrapy.spider

Scrapy.spider is the simplest root spider that every other spider inherits from.

class MySpider(scrapy.Spider):
name = ‘example’
allowed_domains = [‘example.com’]

start_urls = [
‘http://www.example.com/1.html',
‘http://www.example.com/2.html'
]

def parse(self, response):
# xpath/css expressions here

yield(item)

Each spider must have:

name — should be unique.
allowed_domains — specifies what domain it is allowed to scrape.
start_urls — specify what pages you want to scrape within that domain.
parse method — takes the HTTP response and parses the target elements that we specified with selectors.
yield — keyword to generate many dictionaries containing the data.

To set these properties dynamically, use __init__method. So you may use the data coming from your Django views:

class MySpider(scrapy.Spider):
    name = 'example'

def __init__(self, *args, **kwargs):
        self.url = kwargs.get('url')
        self.domain = kwargs.get('domain')
        self.start_urls = [self.url]
        self.allowed_domains = [self.domain]
def parse(self, response):
 ...

You don’t need an additional method to generate your requests here, but how are the requests generated? Scrapy.spider provides a default start_requests() implementation. It sends requests to the URLs defined in start_urls. Then it calls parse method for each response one by one. However, you may need to override it in some circumstances. For instance, if the page requires a login, you must override it with a POST request.

The other spiders Scrapy provides are CrawlSpider, which provides a convenient mechanism for the following links by defining a set of rules, XMLFeedSpider to scrape XML pages, CSVFeedSpider to scrape CSV files, and SitemapSpider to scrape URLs held in the sitemap file.

Note: Scrapy is asynchronous by default which means you can chain your responses. So requests are scheduled and processed (one by one) asynchronously.

def parse(self, response):

for page in range(self.total_pages): 
 url = 'https://example.com/something.html'
    for page in range(self.total_pages): #  each page is self explainable because it waits until previous page is done
     yield Requests(url.format(page), callback=self.parse)

Items

Spiders can return the extracted data as Python key-value pairs. This is similar to Django models except that it is much simpler. You can choose from different item types like dictionaries or item objects. You can also adjust different types using itemadapter

#items.py

class BrandsItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field() 
    ..

You may use scrapy-djangoitem extension that defines Scrapy Items using existing Django models.

from scrapy_djangoitem import DjangoItem
from products.models import Product

class BrandsItem(DjangoItem): 
   django_model = Product
   stock = scrapy.Field() # You can still add extra fields

When you declare an item class, you can directly save data as items.

# Inside spider class
...

def parse(self, response): 
  item = BrandsItem()
  item['brand'] = 'ExampleBrand'
  item['name']= response.xpath('//h1[@class=”title”]/text()').get()
  ..
  yield(item)

Item Pipeline

The Item Pipeline class basically receives and processes an item. It may validate, filter, drop and save items to the database. To use it, you should enable it in the settings.py.

ITEM_PIPELINES = { ‘amazon.pipelines.AmazonPipeline’: 300}

Each item pipeline has a process_itemmethod that returns the modified item or raises a dropItem exception.

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class BrandsPipeline:

# parameters are scraped item and its spider
def process_item(self, item, spider): 
        adapter = ItemAdapter(item)
        if adapter.get('price'): # if scraped data has a price
           item.save() # save it to database 
           return item
        else:
            raise DropItem(f"Missing price in {item}")

Run Spiders with Django Views

Instead of the typical way of running Scrapy, via scrapy crawl, you can connect your spiders with django-views, which automates the scraping process. This creates a real-time full-stack application with a standalone crawler.

The whole process is described in the image below:

The client sends a request to the server with URLs to scrape. URLs might come with user input or something else depending on your needs. The server takes the request, triggers Scrapy to crawl the target elements. Spiders use selectors to extract data. It might use Item Pipeline to customize it before storing it. Once data is in storage, it means scrapy status is finished. As the last step, your server may call back that data from the storage and send it with a response to the client.

The problem in this process is in the seventh step. There is no way for the Django app to know when the scaping status is finished. So, to have a persistent connection with your server, you may need to send requests to it every second, asking, “Hey! Is there anything else?” However, it is not an effective way to build a real-time application. A better solution is to use web sockets. The Django channels library establishes a WebSocket connection with the browser.

Helper libraries can help creating your real-time application with Scrapy. Scrapyd is a standalone service running on a server where you can deploy and control your spiders. The ScrapyRT library ensures responses are returned immediately as JSON instead of having the data saved in a database, so you can create your own API.

Next Steps

This article is intended to be a practical guide to those who want to explore Scrapy structure beyond connecting it with Django. There are a lot more things you can do with Scrapy. With a couple of lines, you can design a web crawler that automatically navigates to your target website and extracts the data you need. Many websites run entirely on JavaScript nowadays. So sometimes you may need to open a modal or press a button to scrape data. This would become a nightmare if you use other tools instead of Scrapy. Thankfully, there is a plugin for scrapy-splash integration for handling JavaScript code easily for your target website. Besides, Scrapy handles errors gracefully. It even has a built-in ability to resume scraping from the last page if it encounters an error from the page. Although you get all of them for free, you need to allocate some time to learn it as it is not as easy to use compared to other scraping tools. For the next steps, if you are intended to create your standalone crawler, you may find adriancast’s Scrapyd-Django-Template helpful. Check out how they implemented Scrapyd to deploy and run Scrapy spiders inside the Django app.

Make a Robust Crawler with Scrapy and Django was originally published in codeburst on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Introduction to Web Scraping

Bilge Demirkaya — Mon, 22 Feb 2021 16:51:54 GMT

If the only way you access the Internet is through your browser, you are missing out on a huge range of possibilities.

Photo by Nathan Dumlao on Unsplash

If you search “Cheapest Flight to Istanbul” on Google, you will get some popular flight search website results as well as some advertisements. Google will only report what these websites say in their content. On the other hand, however, a well-prepared scraper can provide far greater information including a detailed chart of flights to Istanbul containing price changes over time, as well as a suggestion for the best time to buy the tickets to Istanbul by gathering data across a variety of websites.

So, what is Web Scraping exactly?

In general terms, the main goal of web scraping is to extract structured data from the unstructured web pages. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

There are different techniques and powerful tools you can do web scraping. For example, if you manually copy-paste the data from a web page to a text editor, you are carrying out a basic form of web scraping. In fact, this technique might be the best method to employ when websites explicitly set up barriers to prevent machine automation.

Should I use API instead of scraping?

APIs can be extremely useful if you find the ones that satisfy your needs. However, it’s more likely that the API you’d like to use may not exist or that it may not be useful for your purposes.

If you need to collect data from a various number of websites, you will probably struggle to find each of the APIs. Perhaps some websites might not want to share their data or might not prepare an API even.
Sometimes when you find a useful API, they may limit requests coming from the same IP. They would also limit data to share with their APIs.

As a result, when you need to gather data across websites and the given API is not useful for your purposes (due to the reasons above), web scraping is going to your best option.

How to know when and when not to scrape.

I was astonished to learn that, with few exceptions, if you can view data in your browser then you can access it via a script. This is because if you access data via a script, you can store it in your database and do anything with the data from there.

But is it legal?

This is a question that I really struggled to find an answer to. I even reached out to some big companies and asked ‘is it okay to scrape you?’ If you email as I did, many companies will never tell you that you can scrape them. In fact, big companies use scraping themselves but also don’t want others to use bots against them. In case you hadn’t realized it yet; Google search is both a web crawler and a web scraper. Google’s crawler is known as Googlebot. Through crawling and scraping of data, Googlebot discovers new and updated pages to add to the Google search index.

Recently, music lyrics repository Genius accused Google of lifting lyrics and posting them on its search platform. However, the court has dismissed this lawsuit.

Court dismisses Genius lawsuit over lyrics-scraping by Google

While data gathering is legal in many countries, it varies across the world and has some ambiguity from country to country. With this in mind, here are some good rules to follow:

Be careful not to cause damage or use data for bad purposes. There are lots of bad bots out there and many of them eventually get sued.
It is most certainly illegal to analyze, change, manipulate data or sell it to someone else.

I found this article useful for more details:

Web Scraping for Data Science — Is it legal?

Most say that as long as you’re scraping public information, and not causing any damage to anybody while doing so, your actions are legal.

Fun fact: Amazon is the most scraped website in the world.

What about the security checks of the websites?

I advise you to respect their security checks and scraping protection mechanisms. You can check them at ‘domainname/robots.txt’. You can get banned from scraping them after frequent attempts. However, know that most of them are surmountable. A sample solution to IP ban:

Scraping in Python - Preventing IP ban

What is the best language and tools and for web scraping?

Python is the most popular language for web scraping. It can handle almost all of the data extraction processes and most of the web scraping requirements.

Scrapy, Selenium and Beautiful Soup are the most widely used web scraping frameworks written in Python.

Scrapy builds a robust system for extracting data even from complicated websites with lots of security checks. It’s 20 times faster than other tools. If your project is a big one and needs proxies, data pipeline, I recommend Scrapy even though it might take a little time to master it.

Check this link out for more detail:

Scrapy Vs Selenium Vs Beautiful Soup for Web Scraping.

Is web scraping different from web crawling?

Yes. When you only need data from a specific URL, use a web scraper. When you need to get URLs to scrape first and then get the data off them, use both a web crawler and a web scraper. While the web crawler creates the URLs to fetch, the web scraper will take the data out of those pages.

Let’s say you want to search for a whole website with a t-shirt keyword and then get all of the t-shirt titles and prices.

Step 1: Crawl the search URL, fetching all of the URLs with the t-shirt keyword.

Step 2: Scrape each of the URLs of the list from step 1, and return the title and price of the t-shirts.

Conclusion

These are the questions that I struggled with when I am new to web scraping. I hope it will be a helpful tool to understand some basics of web scraping, thanks for reading!

Please note: I am not an expert when it comes to the legality around the subject of web scraping and suggest that you do your own research regarding what is and isn’t legal when learning more about the subject. This article is intended to be a helpful introduction to the topic of web scraping and should not be used as legal advice for any of the topics covered here. Neither I nor Codeburst are responsible for any illegal action taken by readers in relation to web scraping or associated activities.

An Introduction to Web Scraping was originally published in codeburst on Medium, where people are continuing the conversation by highlighting and responding to this story.

How To Create A Popup Modal With CSS + JavaScript

Bilge Demirkaya — Sun, 07 Feb 2021 17:53:39 GMT

Photo by Andrew Neel on Unsplash

A modal is a popup window that is displayed in front of the current page when pressed a button.

Here’s my example, I have recently created a Twitter-like website using Django.

First step: style your parent section

The first thing you want to do is create a semi-transparent black background. To do that you need to create a parent element which wraps your content section.

Let’s style this modal now:

.bg-modal{
 width:100%;
 height:100%;
 background-color: black;
 }

You also need to tell the background modal to lay over top of the content.

position: absolute;

When you say absolute, you need to say where it is.

top:0px;

The problem here is if we do opacity: 0.7for decreasing opacity of background, this opacity value will transfer to the child element no matter whether you specify the child element’s opacity or not. The solution is not to specify opacity in here but in the background colour.

bacground-color: rgba(0,0,0,0.7) // the 4th number will be the opacity

Let’s look at what we have so far:

.bg-modal{
 width: 100%;
 height:100%;
 background-color: rgba(0,0,0,0.5); /* make it half transparent */
 position: absolute;
 top:0px;
 z-index: 1;
 display: none;  /* It will remain invisible until you open */
 justify-content: center; /*center horizontally*/ 
 align-items: center ; /* center vertically */}

Second Step: style your content section

Here is mine.

.modal-content{
 width:600px;
 height:300px;
 background-color: white;
 border:none; 
 border-radius: 15px;
 padding:15px;
 position: relative;
 }

If parent element’s position is absolute that means you choose the exact location for your elements. And you need to set your child element position relative. Consequently, when you specify the exact pixel location of the elements, it will position your child elements inside of parent element accordingly.

Third Step: style the remove button

Now you need a button that you can close your modal.

I prefer to use + and rotate it in CSS code, which will give it a better look than using just X. However, it is up to you.

This is how I style it:

.close{
 position: absolute;
 top:5px;
 right: 10px;
 font-size:25px;
 transform: rotate(45deg); /* will make it look like x */
 cursor: pointer;
 }

Let’s add some JavaScript to make this button actually work. Right now our modal is hidden. All we want to do is click a button and open the popup.

const openButton = document.getElementById('myBtn')
const modal =  document.querySelector('.bg-modal')

openButton.addEventListener('click', () => {
  modal.style.display ='flex'
}

And click a button and hide the popup.

const closeBtn =  document.querySelector('.close')
const modal =  document.querySelector('.bg-modal')

closeBtn.addEventListener('click', () => {
  modal.style.display ='none'
})

Conclusion

It’s a simple JS, CSS code that will help you to moderate and style your website a lot. You can fill the content section as you wish. Hope this helps!

How To Create A Popup Modal With CSS + JavaScript was originally published in JavaScript in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

How To Find The Most Frequent Element In An Array In JavaScript

Bilge Demirkaya — Fri, 05 Feb 2021 16:28:32 GMT

Easiest Way to Find the Most Frequent Element in Array

Photo by Caspar Camille Rubin on Unsplash

First of all, while browsing, I couldn’t find the most effective and short solution to find the most frequent element in an array but I only found a lot of long solutions with for loop however the best way is not using nested for loop. Because in Big O Notation, it’s a O(n2) complexity which means it is not effective.

The best way to find the most frequent element in an array is by using reduce function and creating a hashmap. And that’s the whole code you need:

function getMostFrequent(arr) {
   const hashmap = arr.reduce( (acc, val) => {
    acc[val] = (acc[val] || 0 ) + 1
    return acc
 },{})
return Object.keys(hashmap).reduce((a, b) => hashmap[a] > hashmap[b] ? a : b)
}

What I did was, creating a hashmap using reduce. If we have an array like [‘john’, ‘doe’, ’john’, ’bilge’], our hashmap will look like this:

Because we create an object using acc in the first reduce function, notice the initial value of acc is {}.

Then we check for each value of the array: is this value already in acc?

If no, put a key-value pair in the object. ( first appearance of the element)

If yes, increment the value of it.

Once we have the hashmap with elements and their occurrence number in the array, then we just need to get the key with the biggest value.

For that, we simply find the biggest value in the array and return the key of it with the reduce function like this.

Object.keys(hashmap).reduce((a, b) => hashmap[a] > hashmap[b] ? a : b)

Note: As you probably noticed, this will return only one key with the highest value, if you have two elements with the highest value and you want to return an array of it then you need to change the second reduce function.

In case you have not only one maximum in the array and you want to return an array of max values you can use:

return Object.keys(hashmap).filter(x => {
             return hashmap[x] == Math.max.apply(null, 
             Object.values(hashmap))
       })

Notice this will return an array of the most frequent element, even if there is one. It will filter the elements which have not the max value and return it.

Conclusion

And there we have it! How to find the most common element in an array. I hope you have found this useful!

How To Find The Most Frequent Element In An Array In JavaScript was originally published in JavaScript in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.