dask_ml.feature_extraction.text.CountVectorizer

`dask_ml.feature_extraction.text`.CountVectorizer¶

class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶

将文本文档集合转换为词语计数矩阵

另请参阅

sklearn.feature_extraction.text.CountVectorizer

注意

当未提供词汇表时，fit_transform 需要对数据集进行两次遍历：第一次学习词汇表，第二次转换数据。如果数据能存放在（分布式）内存中，考虑在调用 fit 或 transform 且未提供 vocabulary 之前将其持久化。

此外，即使在单机上，此实现也受益于存在一个活跃的 dask.distributed.Client。当存在客户端时，学习到的 vocabulary 会被持久化在分布式内存中，这可以节省一些重复计算和冗余通信。

示例

Dask-ML 实现目前要求 raw_documents 是一个文档（字符串列表）的 dask.bag.Bag。

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

方法

`build_analyzer`()	返回一个可调用对象来处理输入数据。
`build_preprocessor`()	返回一个函数，用于在分词之前预处理文本。
`build_tokenizer`()	返回一个函数，用于将字符串分割成词语序列。
`decode`(doc)	将输入解码为 Unicode 符号字符串。
`fit`(raw_documents[, y])	学习原始文档中所有词语的词汇字典。
`fit_transform`(raw_documents[, y])	学习词汇字典并返回文档-词项矩阵。
`get_feature_names_out`([input_features])	获取转换的输出特征名称。
`get_metadata_routing`()	获取此对象的元数据路由。
`get_params`([deep])	获取此估计器的参数。
`get_stop_words`()	构建或获取有效的停用词列表。
`inverse_transform`(X)	返回 X 中非零条目对应的每文档词项。
`set_params`(**params)	设置此估计器的参数。
`transform`(raw_documents)	将文档转换为文档-词项矩阵。

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶

dask_ml.preprocessing.BlockTransformer

dask_ml.feature_extraction.text.HashingVectorizer

dask_ml.feature_extraction.text.CountVectorizer

dask_ml.feature_extraction.text.CountVectorizer¶

`dask_ml.feature_extraction.text`.CountVectorizer¶