Extracting data from html table using python and beautiful soup


<table class="softwares" border="1" cellpadding="0" width="99%">
    <thead style="background-color: #ededed">
        <tr>
            <td colspan="5"><b>Windows</b></td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><b>Type</b></td>
            <td><b>Issue</b></td>
            <td><b>Restart</b></td>
            <td><b>Severity</b></td>  
            <td><b>Impact</b></td>  
        </tr>
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Critical<br></td>
            <td>stuff<br></td>
        </tr>    
        <tr>
            <td>some item</td>
            <td><a href="some website">some website</a><br></td>
            <td>Yes<br></td>
            <td>Important<br></td>
            <td>stuff<br></td>    
        </tr>
    </tbody>
</table>

The html page that I am trying to get the data from is a local file that I have saved onto my pc and is filled with multiple tables formatted the same as this. I'm trying to get the both the title for each of these tables, in this specific case "Windows," as well as the urls that are located in the table body. I have been trying to use beautiful soup and python to get the table titles and the websites and print them in a table with the title on the left, and the corresponding urls on the right, but I am unable to do so. Any help would be greatly appreciated.


Answer

You can use find_all to gather all td tag objects, and then apply additional logic to store the hrefs:

from bs4 import BeautifulSoup as soup
import re
s = soup(re.sub('\s', '', open('filename.html').read()), 'lxml')
final_results = [[i.text, i.find('a')['href']] if i.find('a') else i.text for i in s.find_all('td')]
name = final_results[0]
header = final_results[1:6]
full_results = final_results[6:]

Output:

u'Windows'
[u'Type', u'Issue', u'Restart', u'Severity', u'Impact']
[u'some item', [u'some website', 'some website'], u'Yes', u'Critical', u'stuff', u'some item', [u'some website', 'some website'], u'Yes', u'Important', u'stuff']

The results can be further combined to a dictionary:

table = [dict(zip(header, full_results[i:i+5])) for i in range(0, len(full_results), 5)]

Output:

[{u'Impact': u'stuff', u'Issue': [u'some website', 'some website'], u'Type': u'some item', u'Severity': u'Critical', u'Restart': u'Yes'}, {u'Impact': u'stuff', u'Issue': [u'some website', 'some website'], u'Type': u'some item', u'Severity': u'Important', u'Restart': u'Yes'}]