Ontdek alle mogelijkheden met ApplePY. Bekijk alle features

Prioriteer online marketing met slimme oplossingen

De nummer #1 voor slimmer online marketing

Zien is geloven! Test de kracht van ApplePY in jouw gratis proefperiode.

Monitor XML sitemaps op grote schaal

Monitor XML sitemaps op grote schaal

Bij het monitoren van zeer grote websites is het vaak een verspilling van middelen om URL’s op grote schaal te crawlen of te monitoren. Je kunt een kleinere set aan URL’s crawlen die representatief zijn voor de gehele website. Dit is nuttig voor zaken als:

  1. Status HTTP checken,
  2. Monitoren van performance,
  3. Index monitoring (Search Console API > URL inspectietool),
  4. Verkeerswijzigingen.


Hiermee kun je een aantal schattingen maken van aanpassingen over alle URL's, de toolingkosten beperken en blijf je binnen de API-quota.

Hieronder een gratis Python script die de XML-sitemaps scraped.Uiteraard is het belangrijk ervoor te zorgen dat de XML-sitemaps in orde zijn voordat je dit uitvoert.

import gzip
import math
import random
import re

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0"
CONFIDENCE_LEVEL_CONSTANTS = [50,.67], [68,.99], [90,1.64], [95,1.96], [99,2.57] # CALCULATE THE SAMPLE SIZE # http://bc-forensics.com/?p=15 def sample_size(population_size, confidence_level, confidence_interval): Z = 0.0 p = 0.5 e = confidence_interval/100.0 N = population_size n_0 = 0.0 n = 0.0 # LOOP THROUGH SUPPORTED CONFIDENCE LEVELS AND FIND THE NUM STD # DEVIATIONS FOR THAT CONFIDENCE LEVEL for i in CONFIDENCE_LEVEL_CONSTANTS: if i[0] == confidence_level: Z = i[1] if Z == 0.0: return -1 # CALC SAMPLE SIZE n_0 = ((Z**2) * p * (1-p)) / (e**2) # ADJUST SAMPLE SIZE FOR FINITE POPULATION n = n_0 / (1 + ((n_0 - 1) / float(N)) ) return int(math.ceil(n)) # THE SAMPLE SIZE def parse_sitemap(s_url, pbar=None, level=1): result = [] count = 0 existing_pbar = isinstance(pbar, tqdm) # Bot headers = {"User-Agent": USER_AGENT, "Accept-Encoding": "gzip"} response = requests.get(s_url, headers=headers) if response.status_code == 200: content = response.content else: print(f"ERROR: Sitemap returned invalid response: {response.status_code}") return result # If explicit content type is set, decompress if response.headers['Content-Type'].lower() == 'application/x-gzip': content = gzip.decompress(content) try: # Convert to readable string xml = str(content, 'UTF-8') except UnicodeDecodeError as e: # Convert to readable string after trying to decompress first xml = str(gzip.decompress(content), 'UTF-8') except Exception as e: # Not sure what filetype this is, so exiting print('ERROR: Could not parse XML file. Error:', str(e)) return result, count soup = BeautifulSoup(xml, features = "xml") urls = soup.find_all("loc") if len(urls) > 0: text_urls = [url.get_text() for url in urls] test_indexes_first = len(re.findall(r"\.(xml|gz|xml\.gz)$", text_urls[0])) > 0 test_indexes_last = len(re.findall(r"\.(xml|gz|xml\.gz)$", text_urls[-1])) > 0 test_mixed_urls = True if test_indexes_first and not test_indexes_last or test_indexes_last and not test_indexes_first else False sitemap_urls = [] html_urls = [] if test_mixed_urls: sitemap_urls = [url for url in text_urls if len(re.findall(r"\.(xml|gz|xml\.gz)$", url)) > 0] html_urls = [url for url in text_urls if url not in sitemap_urls] elif test_indexes_first: sitemap_urls = text_urls else: html_urls = text_urls if len(sitemap_urls) > 0: if existing_pbar: prior_n = pbar.n prior_t = pbar.total pbar.reset(total=int(prior_t+len(sitemap_urls))) pbar.update(prior_n) pbar.refresh() else: pbar = tqdm(desc=f"Parsing {s_url}", total=len(sitemap_urls)) for url in sitemap_urls: locs, c = parse_sitemap(url, pbar=None if level == 1 else pbar, level=level+1) result.extend(locs) count = count+c pbar.update() if len(html_urls) > 0: result.extend(html_urls) count += len(html_urls) else: print('ERROR: No URLs Found', 'Status:', response.status_code) return result, count def parse_sitemap_threaded(s_url): urls, url_count = parse_sitemap(s_url, leave=None) return urls, url_count, sitemap_url def url_status(url): headers = {"User-Agent": USER_AGENT} response = requests.get(url, headers=headers) return int(response.status_code) output_filename = "output.csv" sitemap_url = 'https://applepy.online/sitemap.xml' confidence_level = 99.0 confidence_interval = 3.0 urls, url_count = parse_sitemap(sitemap_url) url_sample_sz = sample_size(url_count, confidence_level, confidence_interval) samples = [urls[i] for i in random.sample(range(url_count), url_sample_sz) ] print('Number of Samples:', len(samples)) print('Number of Total URLs:', len(urls)) print('Percent of Total URLs:', round(len(samples)/len(urls), 2)) print('Creating DataFrame') statuses = [url_status(u) for u in tqdm(samples)] df = pd.DataFrame({'samples':samples, 'statuses': statuses}) df.to_csv(output_filename, index=None) print(output_filename, 'saved.') df.head()

Het monitoren van XML sitemaps op grote schaal is belangrijk voor zoekmachineoptimalisatie om ervoor te zorgen dat alle pagina's op een website worden geïndexeerd door zoekmachines. Meer informatie over XML-sitemaps. 

💣

TIP! Met ApplePY heb je heel veel extra scripts om topic clusters aan te maken. Maar ook nog meer dan 50+ andere scripts voor andere toepassingen. Er zijn talloze scripts en elke maand komen daar weer nieuwe scripts bij. Probeer ApplePY gratis.

Nieuwste adviezen

Gerelateerde artikelen