console image while scraping is being done

News scraping - La Discusión newspaper

fathooo
Technology, Raspar, Back end


- See on github

Last script test - 03/03/2022


This script is a data extractor (scraping), in this case, I extracted the content from the newspaper https://www.ladiscusion.cl, which is a newspaper from my city. This scraping is divided into two parts:

Before starting, we checked robots.txt and verified the permissions granted by the newspaper.

2020220303132944.webp

Once ready, we continue.


  • The script creates two folders, data and data_content. The categories with the links will be stored in a dataframe in the data folder, and the content will be stored in the data_content folder.

  • The first step to use the script is to use the first option:

2020220303133359.webp

  • Once the [1] command is entered, the following will appear:

2020220303133939.webp

  • If the 1 command is entered, it will start scraping all the links from the navigation bar. 2020220303132553.webp

  • However, option 2 allows us to scrape only the data we want. For demonstration purposes, I will show pictures of option 2.

2020220303134235.webp

  • In the previous case, I entered that I only want 0, 3, and 4. Then I ran the code and it started generating the tables.

2020220303134554.webp

Once the first process is finished.

  • We will have the data folder with the dataframes.

2020220303152156.webp

  • Here's a glimpse of the first dataframe.

2020220303140442.webp


Second step - Creation of dataframes with Title - Time - Content - Subtitles

  • With the first step, we will already have all the news links from the page that were available at the time of script execution.
  • The next step is simple.
  • In the script menu, option [2] is available. 2020220303153014.webp
  • This option displays another menu, where it will use all the links found in the files in the data folder.
  • Option [1] is used to do it with all the files.
  • Option [2] is used to scrape specific files.
  • I will use option two as a test.

2020220303153451.webp

  • We can see that some links cannot be opened, but the script continues.

2020220303153805.webp

  • Once finished, we will have the dataframes and content ready for manipulation in our data_content folder.

2020220303154624.webp

-Let's look at the first one as an example of a dataframe. 2020220303160128.webp

  • df.info() as well. 2020220303160158.webp

Scraping duration

  • 40 minutes, First part.
  • 10 hours, Second part.

Aspects to optimize

  • Threads can be used to reduce the scraping time by performing parallel scraping.
  • The files that did not make a get request can be captured for independent scraping.