console image while scraping is being done

News scraping - La Discusión newspaper

fathooo

Technology, Raspar, Back end

Last script test - 03/03/2022

This script is a data extractor (scraping), in this case, I extracted the content from the newspaper https://www.ladiscusion.cl, which is a newspaper from my city. This scraping is divided into two parts:

Before starting, we checked robots.txt and verified the permissions granted by the newspaper.

Once ready, we continue.

The script creates two folders, data and data_content. The categories with the links will be stored in a dataframe in the data folder, and the content will be stored in the data_content folder.
The first step to use the script is to use the first option:

Once the [1] command is entered, the following will appear:

If the 1 command is entered, it will start scraping all the links from the navigation bar.
However, option 2 allows us to scrape only the data we want. For demonstration purposes, I will show pictures of option 2.

In the previous case, I entered that I only want 0, 3, and 4. Then I ran the code and it started generating the tables.

Once the first process is finished.

We will have the data folder with the dataframes.

Here's a glimpse of the first dataframe.

Second step - Creation of dataframes with Title - Time - Content - Subtitles

With the first step, we will already have all the news links from the page that were available at the time of script execution.
The next step is simple.
In the script menu, option [2] is available.
This option displays another menu, where it will use all the links found in the files in the data folder.
Option [1] is used to do it with all the files.
Option [2] is used to scrape specific files.
I will use option two as a test.

We can see that some links cannot be opened, but the script continues.

Once finished, we will have the dataframes and content ready for manipulation in our data_content folder.

-Let's look at the first one as an example of a dataframe.

df.info() as well.

Scraping duration

40 minutes, First part.
10 hours, Second part.

Aspects to optimize