크롤링 판정서 1차 문서 내용

2023/근복

크롤링 판정서 1차 문서 내용

notty 2023. 10. 30. 13:33

728x90

import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
import time
from selenium.webdriver.common.by import By 

driver = webdriver.Chrome()

# Step 1: Send an HTTP GET request to the URL
pg_num = 66
data = []
doc_data = []
sections = ['주문', '청구 취지', '신청 내용', '신청인 주장', '진료기록 및 의학적 소견', '인정 사실', '관계 법령', '위원회 판단 및 결론']
parsed_data = {section: [] for section in sections} 
# parsed_data = {header.text: [] for header in soup.find_all('h1')}

# parsed_data = {}

for i in range(1,pg_num+1):
    url = f'https://jilbyungcase.comwel.or.kr/service/dataList?qw=&q=&gubun=%EC%A7%81%EC%97%85%EC%84%B1%EC%95%94+%EB%93%B1+%EC%95%85%EC%84%B1%EC%8B%A0%EC%83%9D%EB%AC%BC&gubun2=&viewType=&sortField=sort5&sortOrder=desc&pageIndex={i}&pageUnit=20'
    # response = requests.get(url)
    driver.get(url)
    time.sleep(2)
    for btn_i in range(20):
        driver.find_elements(By.CLASS_NAME,'btn-badge')[btn_i].click()
        time.sleep(5)
        driver.switch_to.window(driver.window_handles[-1])
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        time.sleep(3)



    

        # h1_tags = soup.find_all('h1')
        # for i, h1_tag in enumerate(h1_tags[1:]):
        #     section_title = h1_tag.text
        #     # parsed_data[section_title] = []
        #     temp = []
            
        #     # 현재 h1 태그와 다음 h1 태그 사이의 모든 h4 태그 찾기
        #     next_h1_tag = h1_tags[i + 1] if i + 1 < len(h1_tags) else None
        #     current = h1_tag.find_next_sibling()
            
        #     while current and current != next_h1_tag:
        #         if current.name == 'h4':
        #             temp.append(current.text.strip())
        #         current = current.find_next_sibling()
        #     concatenated_text = " ".join(temp)
        
        # # 연결된 문자열을 리스트에 추가
        #     parsed_data[section_title].append(concatenated_text)

        h1_tags = soup.find_all('h1')
        for i, h1_tag in enumerate(h1_tags[1:]):
            section_title = h1_tag.text
            temp = []
            
            # 현재 h1 태그와 다음 h1 태그 사이의 모든 h4 태그 찾기
            next_h1_tag = h1_tags[i + 2] if i + 2 < len(h1_tags) else None  # 인덱스를 i + 2로 변경
            current = h1_tag.find_next_sibling()
            
            while current and (current != next_h1_tag):
                if current.name == 'h4':
                    temp.append(current.text.strip())
                current = current.find_next_sibling()
            
            concatenated_text = " ".join(temp)
            # 연결된 문자열을 리스트에 추가
            parsed_data[section_title].append(concatenated_text)

        driver.close()
        driver.switch_to.window(driver.window_handles[0])

테이블은 아직 안가져왔음

728x90

'2023 > 근복' 카테고리의 다른 글

크롤링 400채우기 (0)	2023.11.02
필요없는 문자 빼기 (0)	2023.10.30
크롤링 리스트 가져오기 (1)	2023.10.30

현재글크롤링 판정서 1차 문서 내용

250x250

notty

개발자, Algorithm, 알고리즘, pandas, 딥러닝, chunksize, 다항식회귀, 이분탐색, 통계학습, 통계, 파이토치, pandas기초, 위키북스, kaggle learn, 벡터db, 그래프, 인공지능, 파이썬, Pinecone, DP,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

notty

크롤링 판정서 1차 문서 내용

'2023 > 근복' 카테고리의 다른 글

'2023/근복'의 다른글

티스토리툴바

크롤링 판정서 1차 문서 내용

'2023 > 근복' 카테고리의 다른 글

'2023/근복'의 다른글

관련글

티스토리툴바