Убрать часть строки с помощью regex

1

1

Строка

<th class=«tableright»>& pound;1.95& nbsp;& nbsp;& nbsp;& nbsp;& nbsp;</th> <th class=«tableright»>& pound;2.95& nbsp;& nbsp;</th>

Как убрать все, кроме 1.95?

у меня

            
elif 'th' in line and line.islower():
     d[alpha].append(re.sub('\s*<.*?>\D*\s*','', line))

получается (см. после MANGO), то есть в аналогичных строках (например, <th class=«tableright»>& pound;5.95</th>) текст удаляется,а цифры остаются,как видно ниже.

'GREAT ESCAPE', '3.95', 'MELON REFRESHER', '3.95', 'MIXED', '5.95', 'MANGO', '1.95&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.95&nbsp;&nbsp;', 'LYCHEE', '1.95&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.95&nbsp;&nbsp;'

Ссылка

←	Стиль кода

Как правильно запускать фигню в отдельном потоке

→

lxlm и xpath.

vvn_black ★★★★★
(01.02.19 17:51:45 MSK)

Ссылка

Some people, when confronted with a problem, think «I know, I’ll use regular expressions.» Now they have two problems. (c)

Далее: не-регулярный синтаксис (в данном случае context-free) регулярными выражениями не парсится. Использую любой доступный парсер dom и бери значения уже от туда.

beastie ★★★★★
(01.02.19 17:55:28 MSK)
Последнее исправление: beastie 01.02.19 17:59:39 MSK (всего исправлений: 1)

Ссылка

Если костылять, то

>>> line1
'<th class=«tableright»>& pound;1.95& nbsp;& nbsp;& nbsp;& nbsp;& nbsp;</th> <th class=«tableright»>& pound;2.95& nbsp;& nbsp;</th>'
>>> line2
'<th class=«tableright»>& pound;5.95</th>'
>>> import re
>>> pattern = re.compile(r'pound;[0-9]+\.[0-9]+')
>>> re.search(pattern, line1).group(0).split(';')[-1]
'1.95'
>>> re.search(pattern, line2).group(0).split(';')[-1]
'5.95'

Yorween ★
(01.02.19 18:03:58 MSK)

Ссылка

Зачем убирать, наоборот выгребай только числа.

>>> import re
>>> html = '<th class=«tableright»>& pound;1.95& nbsp;& nbsp;& nbsp;& nbsp;& nbsp;</th> <th class=«tableright»>& pound;2.95& nbsp;& nbsp;</th>'
>>> re.findall('[0-9]+\\.[0-9][0-9]', html)
['1.95', '2.95']
>>> list(map(float, re.findall('[0-9]+\\.[0-9][0-9]', html)))
[1.95, 2.95]

Эсли этого перестанет хватать, лучше бери lxml и парси html по-человечески.

slovazap ★★★★★
(01.02.19 18:22:53 MSK)
Последнее исправление: slovazap 01.02.19 18:24:41 MSK (всего исправлений: 3)