# LatestValidProxies **Repository Path**: CodexploRe/LatestValidProxies ## Basic Information - **Project Name**: LatestValidProxies - **Description**: LatestValidProxies为ChenUtils包中的其中一个模块。该模块作用是方便本人的爬虫学习,用来获取当前的有效的匿名代理ip。 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 0 - **Created**: 2023-08-27 - **Last Updated**: 2025-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LatestValidProxies ## Introduction This is one of the packages in my bag (ChenUtils). The purpose of this package is to facilitate my crawler learning, and to obtain the current valid anonymous proxy IP. For beginners in Python, the code is for reference only. For more information about the ChenUtils package, please visit [My Project Address] (https://pypi.org/project/ChenUtils/). ## Import the package Please use the cmd command line window to run the following command, before making sure you have the Python runtime environment and pip tool: ```cmd pip install ChenUtils ``` Then in the python environment 'from ChenUtils.LatestValidProxies import' statement to get the functionality you need. ## Instructions for use ##### 1. Get a valid anonymous proxy IP Spiders.py provide five crawlers with corresponding IP URLs written by myself: ```python BeesProxySpider # www.beesproxy.com Ip89Spider # www.89ip.cn KuaidailiSpider # www.kuaidaili Ip66Spider # www.66ip.cn IhuanSpider # ip.ihuan.me ``` Directly instantiate the object and call the get_one_useful_proxy method, the return is the Proxy object, you can obtain ip, port, speed and other information through properties. It is recommended to use BeesProxySpider, the IP is relatively high quality and the acquisition speed is faster. The reference code is as follows: ```python from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider # from LatestValidProxies.Proxy import Proxy beeproxy_spider = BeesProxySpider() proxy = beeproxy_spider.get_one_useful_proxy() # print(proxy.__dict__) print(proxy.ip, proxy.port) ``` ##### 2. Get a list of multiple valid hidden proxy IPs Call the get_useful_proxies method on the object, and the list of Proxy objects is returned, the default count parameter is sys.maxsize, and the valid proxy IP with the default number of pages will be crawled and assembled into a list when the count parameter is not set. The reference code is as follows: ```python from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider beesproxy_spider = BeesProxySpider() for proxy in beesproxy_spider.get_useful_proxies(): print(proxy.ip, proxy.port) ``` Since it takes a lot of time to crawl the proxy IP of the corresponding number of pages according to the self.max_pages attribute and verify it without passing the count parameter, I do not recommend this kind of usage. In contrast, I recommend that you pass the number of valid proxy IPs you need to obtain into the count parameter, which will greatly reduce the waiting time for you and improve the efficiency of your work. The reference code is as follows: ```python for proxy in beesproxy_spider.get_useful_proxies(5): print(proxy.ip, proxy.port) ``` ##### 3. Get different results for agents in a short time Although the data of the proxy IP website is updated every once in a while, the proxy IP may be blocked during the crawler task, and if the get_one_useful_proxy method is called again for a short time, the same proxy IP is likely to be obtained (because this method and the get_useful_proxies method crawl the proxy IP in the order of the web page). For this, ChenUtils-0.0.7 and later versions have added parameter is_random for both methods. The default value of this parameter is False, when it comes in True, it will randomly shuffle the URL list, and obtain the proxy IP address that is different from the previous call by changing the order of the proxy URL page. The reference code is as follows: ```python from ChenUtils.LatestValidProxies.Spiders import BeesProxySpider beesproxy_spider = BeesProxySpider() proxy = beesproxy_spider.get_one_useful_proxy(is_random=True) ``` ##### 4. Use more parameters to instantiate crawler objects for requirements BaseSpider's subclass provides several parameters for users with more specific needs, and its parent class init file is defined as follows: ```python class BaseSpider(object): def __init__(self, urls, group_xpath, detail_xpath, show_logs, max_pages, encoding, max_worker, datetime_format, highest_latency): self.urls = urls self.group_xpath = group_xpath self.detail_xpath = detail_xpath self._to_end = False self.show_logs = show_logs self.max_pages = max_pages self.encoding = encoding self.max_worker = max_worker self.datetime_format = datetime_format self.test_timeout = highest_latency ``` You can use the following parameters in the subclass to meet your specific needs, as explained below: * **highest_latency**: Set the maximum latency requirement for the obtained proxy IP * **max_pages**: Modify the maximum number of pages crawled by default * **max_worker**: Sets the maximum number of threads for the crawler * **encoding**: Modify the decoding mode, when the proxy address obtained by the specific crawler is displayed as Chinese garbled characters, you can modify the decoding method here * **datetime_format**: Set the format of proxy IP time keeping * **show_logs**: The default value of the parameter show_logs of the Spider class is False, that is, the crawler operation log is not displayed. If you need to know the progress of the crawler run, you can choose to pass True to it, as shown in the following reference code: ```python beesproxy_spider = BeesProxySpider(show_logs=True) proxy = beesproxy_spider.get_one_useful_proxy() print(proxy.ip, proxy.port) ``` The above reference code will have your log object output something similar to the following in the terminal: ```python 2023-08-30 22:00:56 Spider.py [line:25] INFO: Trying to establish a connection with https://www.beesproxy.com/free/page/1 2023-08-30 22:00:58 Spider.py [line:36] INFO: Collecting web proxy IP information 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 60.205.132.71:80 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 183.230.162.122:9091 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 111.43.105.50:9091 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 61.133.66.69:9002 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 112.250.110.172:9091 2023-08-30 22:00:58 Spider.py [line:78] INFO: Detecting proxy ip: 111.20.217.178:9091 2023-08-30 22:01:02 Spider.py [line:78] INFO: Detecting proxy ip: 120.196.188.21:9091 2023-08-30 22:01:02 Spider.py [line:98] INFO: Found a valid anonymous proxy ip: 61.133.66.69:9002 2023-08-30 22:01:02 Spider.py [line:102] INFO: This crawling proxy IP takes: 5.9 s, waiting for the end of other threads, estimated time consuming 0~5 s 61.133.66.69 9002 ``` ##### 5. Add crawls to custom sites The first is to construct the specific crawler of the target website, which is divided into the following steps: 1. Inherit the BaseSpider class 2. Construct a list of URLs based on the characteristics of website links 3. Get the grouped xpath and within-group xpath of the table for a specific website 4. Construct the specific crawler class in the format of the reference code The reference code is as follows: ```python from ChenUtils.LatestValidProxies.BaseSpider import BaseSpider class XXSpider(BaseSpider): def __init__(self, show_logs=False, max_pages=10, encoding=None, max_worker=5, datetime_format='%Y-%m-%d %H:%M:%S', highest_latency=5): urls = [f'https://xxxxx/{page}' for page in range(1, max_pages + 1)] group_xpath = '//*[@id="xxxx"]/xxxx/table/tbody/tr' # xpath for the entire proxy IP table detail_xpath = { 'ip': './td[1]/text()', # xpath for ip lattice text 'port': './td[2]/text()', # xpath of the port 'area': './td[3]/text()' # xpath of area grid } super().__init__(urls=urls, group_xpath=group_xpath, detail_xpath=detail_xpath, show_logs=show_logs, max_pages=max_pages, encoding=encoding, max_worker=max_worker, datetime_format=datetime_format, highest_latency=highest_latency) ``` After that, refer to the instructions above and call the method crawl by instantiating the object. When you encounter a specific problem with a specific website that needs to be dealt with, you can override the method or add methods to the specific crawler class to correct it.