Fathooo
Menu
Contactar

News extraction process through scraping at Diario La Discusión

Scraping system to extract news, links, and content from the Diario La Discusión.

BeautifulSoup Python Requests SQLite Scrapy

- View on GitHub

Last Test of the Script - 03 / 03 / 2022

This script is a data extractor (scraping), in this case, I extracted the content from the newspaper https://www.ladiscusion.cl, which is a newspaper from my city. This scraping is divided into two parts:

Before starting, we check robots.txt and verify the permissions granted by the newspaper.

2020220303132944.webp

Once ready, we continue.

  • The script creates two folders, data and data_content. In the data folder, categories with links will be stored in a DataFrame, and in data_content, the content will be stored.

  • The first step to use the script is to select the first option:

2020220303133359.webp

  • Once the command [1] is entered, the following will appear:

2020220303133939.webp

  • If command 1 is entered, it will start scraping all the links from the navigation bar. 2020220303132553.webp

  • However, option 2 allows us to scrape only the data we want. For demonstration, I will show images of option 2.

2020220303134235.webp

  • In the previous case, I entered that I only want 0, 3, and 4. I then ran the code, and it started generating the tables.

2020220303134554.webp

Once the first process is completed.

  • We will have the data folder with the DataFrames.

2020220303152156.webp

  • Here is a glimpse of the first DataFrame.

2020220303140442.webp

Second Step - Creation of DataFrames with Title - Time - Content - Subtitles

  • With the first step, we will have all the news links from the page that were available at the time of executing the script.
  • The next step is simple.
  • In the script menu, there is option [2]. 2020220303153014.webp
  • This option displays another menu, which will use all the links found in the files in the data folder.
  • Option [1] to process all of them.
  • Option [2] to scrape specific files.
  • I will use option two as a test.

2020220303153451.webp

  • We can observe that some links cannot be opened; however, the script continues.

2020220303153805.webp

  • Once finished, in our data_content folder, we will have the DataFrames and content ready for manipulation.

2020220303154624.webp

  • We will look at the first one as an example of a DataFrame. 2020220303160128.webp

  • The df.info() as well. 2020220303160158.webp

Duration of the Scraping

  • 40 minutes, First part.
  • 10 hours, Second part.

Aspects to Optimize

  • Threads can be used to reduce scraping time by scraping in parallel.
  • Files that did not receive a GET request can be captured for independent scraping.
News extraction process through scraping at Diario La Discusión