python：文本

2024-11-27

Word count: 497 | Reading time≈ 2 min

题目：请分析出 20240601.txt 出现频率最高的前 20 个单词及次数

提示：

原始的单词需要排除 stopwords-en.txt 中出现的单词。
stopwords 需要加上 "A)"、"B)"、"C)"、"D)"。
所有单词小写。

输出格式：

20 位左对齐

代码实现

import re
from collections import Counter

# 读取文件内容
with open('20240601.txt', 'r', encoding='utf-8') as file: # 20240601.txt替换为自己的文件路径
    text = file.read().lower()

# 读取停用词文件并添加自定义停用词
stopwords_file = 'stopwords-en.txt' # stopwords-en.txt替换为自己的文件路径
custom_stopwords = ['A)', 'B)', 'C)', 'D)']

with open(stopwords_file, 'r', encoding='utf-8') as file:
    stopwords = file.read().splitlines()
    stopwords.extend(custom_stopwords)

# 使用正则表达式提取单词
words = re.findall(r'\b\w+\b', text)

# 过滤掉停用词
filtered_words = [word for word in words if word not in stopwords]

# 统计每个单词出现的频率
word_counts = Counter(filtered_words)

# 输出前20个最频繁的单词及次数
for word, count in word_counts.most_common(20):
    print(f'{word:<20}{count}')

代码解释：

读取文件内容：使用 open 函数读取 20240601.txt 文件的内容，并将其转换为小写。
读取停用词文件并添加自定义停用词：读取 stopwords-en.txt 文件中的停用词，并将 "A)"、"B)"、"C)"、
"D)" 添加到停用词列表中。
使用正则表达式提取单词：使用正则表达式 \b\w+\b 提取文本中的所有单词。
过滤掉停用词：通过列表推导式将停用词从单词列表中移除。
统计每个单词出现的频率：使用 Counter 类统计每个单词出现的次数。
输出前 20 个最频繁的单词及次数：使用 most_common(20) 方法获取前 20 个最频繁的单词及其出现次数，并使用格式化
字符串 {word:<20}{count} 左对齐输出。

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.