Spooky Author Identification¶

Да погледнем правилата на състезанието:

https://www.kaggle.com/c/spooky-author-identification

# Dataset
import pandas as pd
train = pd.read_csv("data/spooky-authors/train.zip", index_col=['id'])
test = pd.read_csv("data/spooky-authors/test.zip", index_col=['id'])
sample_submission = pd.read_csv("data/spooky-authors/sample_submission.zip", index_col=['id'])

print(train.shape, test.shape, sample_submission.shape)
print(set(train.columns) - set(test.columns))

(19579, 2) (8392, 1) (8392, 3)
{'author'}

train.head(5)

Авторите¶

EAP - Едгар Алън По¶

американски писател
Гарванът
19 януари 1809 г.

HPL - Хауърд Филипс Лъвкрафт¶

американски писател
Ктхулу цикъл
20 август 1890 г.

MWS - Мери Уолстонкрафт Шели¶

английска писателка
Франкенщайн
30 август 1797 г.

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

train.author = train.author.replace(['EAP', 'HPL', 'MWS'], ['Едгар', 'Хауърд', 'Мери'])

sns.countplot(data=train, x='author');

all_words = train['text'].str.split(expand=True).unstack().value_counts()
all_words.head(15)

the     33296
of      20851
and     17059
to      12615
I       10382
a       10359
in       8787
was      6440
that     5988
my       5037
had      4324
with     4207
his      3802
as       3528
he       3422
dtype: int64

all_words.tail(15)

penance,         1
creaters         1
waker,           1
sick?            1
antagonist?"     1
oversensitive    1
Mesmerism;       1
Barnard          1
rejects          1
World,           1
spit,            1
coldest          1
chinless         1
ceases           1
preponderate.    1
dtype: int64

Най-честите думи са общи.

В най-редките има имена.
Също така има и препинателни знаци.

eap = train[train.author=="Едгар"].text.values
hpl = train[train.author=="Хауърд"].text.values
mws = train[train.author=="Мери"].text.values

Word cloud¶

from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import numpy as np

def plot_wordcloud_mask(words, img_path):
    img = Image.open(img_path)
    img_mask = np.array(img)

    plt.figure(figsize=(12,8))
    wc = WordCloud(background_color="black", max_words=10000, mask=img_mask,
                   stopwords=STOPWORDS, max_font_size= 40)
    wc.generate(" ".join(words))
    plt.imshow(wc.recolor( colormap= 'Pastel1_r' , random_state=17), alpha=0.98)
    plt.axis('off');

plot_wordcloud_mask(hpl, "data/spooky-authors/hpl.png")

plot_wordcloud_mask(eap, "data/spooky-authors/eap.png")

plot_wordcloud_mask(mws, "data/spooky-authors/mws.png")

Идеи за фичъри:¶

CountVectorizer, Tfidf
Preprocessing - stop words, lematization
Други фичъри - бр. думи , бр. стоп думи, бр. пунктуация, бр. ГЛАВНИ букви и т.н.
Намиране на общи теми чрез LDA
Word Embeddings с невронни мрежи

Първо - baseline модел¶

from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
    ('features', CountVectorizer()),
    ('clf', LinearSVC())
])

cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3)

array([ 0.78783701,  0.79635305,  0.79509579])

Да проверим какво е научил CountVectorizer

pipeline.fit(train.text, train.author)
count_vectorizer = pipeline.steps[0][1]
count_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

list(count_vectorizer.vocabulary_.items())[:15]

[('this', 22175),
 ('process', 17139),
 ('however', 10784),
 ('afforded', 455),
 ('me', 13678),
 ('no', 14817),
 ('means', 13696),
 ('of', 15145),
 ('ascertaining', 1300),
 ('the', 22085),
 ('dimensions', 6133),
 ('my', 14491),
 ('dungeon', 6898),
 ('as', 1287),
 ('might', 13930)]

Как работи CountVectorizer?¶

Другото му име е "bag of words".

Подобно на one-hot encoding но за текст.

Когато се извика fit - прави речник с всички думи в корпуса (датасета) и им поставя индекс за всяка уникална дума.
Когато се извика transform - взима текста от всеки ред и го превръща във вектор, където отбелязва броя на всяка дума от веткора.

Може да брои думите или само да отбелязва наличието им двоично.
Може да работи и с поредици от думи "n-grams".
А може да работи и на ниво символи или каквото му подадем за разделител.

vectorizer = CountVectorizer()
corpus = [
    "Billions and billions of dollars",
    "A lot of money",
    "We are going to make",
    "We are going ot take care of"
]
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 0 2 0 1 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 0 1 1 0 0 0 0]
 [0 1 0 0 0 1 0 1 0 0 0 0 1 1]
 [0 1 0 1 0 1 0 0 0 1 1 1 0 1]]
{'billions': 2, 'and': 0, 'of': 9, 'dollars': 4, 'lot': 6, 'money': 8, 'we': 13, 'are': 1, 'going': 5, 'to': 12, 'make': 7, 'ot': 10, 'take': 11, 'care': 3}

Супер, да покажем и дву-грами

vectorizer = CountVectorizer(ngram_range=(1,2))
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 0 2 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1]
 [0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 1 1]]
{'billions': 4, 'and': 0, 'of': 17, 'dollars': 9, 'billions and': 5, 'and billions': 1, 'billions of': 6, 'of dollars': 18, 'lot': 13, 'money': 16, 'lot of': 14, 'of money': 19, 'we': 26, 'are': 2, 'going': 10, 'to': 24, 'make': 15, 'we are': 27, 'are going': 3, 'going to': 12, 'to make': 25, 'ot': 20, 'take': 22, 'care': 7, 'going ot': 11, 'ot take': 21, 'take care': 23, 'care of': 8}

Това беше кратко отклонение за да видим как работи CountVectorizer.

Да се върнем на модела и да погледнем какви са предсказванията

from sklearn.model_selection import cross_val_predict
prediction = cross_val_predict(pipeline, train.text, train.author, cv=3, n_jobs=3)
prediction

array(['Едгар', 'Едгар', 'Хауърд', ..., 'Едгар', 'Едгар', 'Хауърд'], dtype=object)

Забележете магията - не ползвах LabelEncoder за класовете.

sklearn е достатъчно умен да се оправи сам с категориинете данни в y.

import itertools
from sklearn.metrics import confusion_matrix, accuracy_score

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues,
                          figsize=(9, 7)):
    matrix = confusion_matrix(y_true, y_pred)

    if normalize:
        matrix = matrix.astype('float') / matrix.sum(axis=1)[:, np.newaxis]

    plt.figure(figsize=figsize)
    plt.imshow(matrix, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = matrix.max() / 2.
    for i, j in itertools.product(range(matrix.shape[0]), range(matrix.shape[1])):
        plt.text(j, i, format(matrix[i, j], fmt),
                 horizontalalignment="center",
                 size=int((figsize[0] / 10) * 38),
                 color="white" if matrix[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

authors = pipeline.classes_
print(accuracy_score(train.author, prediction))
plot_confusion_matrix(train.author, prediction, classes=authors)

0.793094642219

plot_confusion_matrix(train.author, prediction, classes=authors, normalize=True)

Да пробваме с RF

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
    ('features', CountVectorizer()),
    ('clf', RandomForestClassifier())
])

cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3)

array([ 0.61458333,  0.61615078,  0.60781609])

В състезанието пише, че оценката ще се мери с LogLoss.

Да видим каквъв резултат ще получим с тази метрика.

cross_val_score(pipeline, train.text, train.author, 
                cv=3, n_jobs=3, scoring='neg_log_loss')

array([-1.43459149, -1.54161751, -1.40121559])

И с линейна регресия защото LinearSVC няма predict_proba по подразбиране.

За да се добави може да се използва decision_function + softmax.

from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('features', CountVectorizer()),
    ('clf', LogisticRegression())
])

print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3))
print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3, 
                      scoring='neg_log_loss'))

# Получихме малко по-добри резултати

[ 0.81449142  0.81673307  0.81348659]
[-0.47678342 -0.4755892  -0.47131898]

Добре, този модел ще го оптимизираме доста.

Какви други фичъри можем да измислим?

Може да има сигнал в:¶

бр. "stopwords"
бр. на препинателни знаци
бр. главни букви
бр. на думи съдържащи само главни буква
бр. на цифрите в текста
средна дължина на дума

Тези може да вкарат и шум:¶

бр. думи в текста
бр. уникални думи в текста
бр. символи в текста

explore = train.copy()

# бр. думи в текста
explore['words'] = explore.text.apply(lambda s: len(str(s).split()))

# бр. уникални думи
explore['unique_words'] = explore.text.apply(lambda s: len(set(str(s).split())))

# бр. символи
explore['symbols'] = explore.text.str.len()

# бр. уникални символи
explore['unique_symbols'] = explore.text.apply(lambda s: len(set(str(s))))

import string

# бр. главни букви
explore['capital_letters'] = explore.text.apply(lambda s: sum([str.isupper(c) for c in str(s)]))

# бр. на думи съдържащи само главни буква
explore['only_capital_letter_words'] = explore.text.apply(lambda s: sum([str.isupper(w) for w in str(s).split()]))

# средна дължина на дума
explore['average_word_lenght'] = explore.text.apply(lambda s: np.mean([len(w) for w in str(s).split()]))

# бр. цифрите
explore['digits'] = explore.text.apply(lambda s: sum([str.isdigit(c) for c in str(s)]))

# бр. на препинателни знаци
train["punctuation"] = train.text.apply(lambda s: sum([c in string.punctuation for c in str(s)]) )

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

import nltk
# nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(len(stopwords))
print(stopwords)

153
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

explore['stop_words'] = explore.text.apply(lambda s: sum(w in stopwords for w in str(s).split()))

explore.head()

Ще създам един лист, в който да пазя имената на фичърите

print(explore.columns)
features_names = list(set(explore.columns) - {'text', 'author'})

Index(['text', 'author', 'words', 'unique_words', 'symbols', 'unique_symbols',
       'capital_letters', 'only_capital_letter_words', 'average_word_lenght',
       'digits', 'stop_words'],
      dtype='object')

for feature in features_names:
    plt.figure()
    sns.violinplot(x=feature, y="author", data=explore)
    plt.title(feature);

Няма много вариация в разпределенията на фичърите.

Нека все пак натренираме модел с тях да видим как ще се държи.

from sklearn.ensemble import RandomForestClassifier
cross_val_score(RandomForestClassifier(), explore[features_names], explore.author, cv=3, n_jobs=3)

array([ 0.40977328,  0.39288998,  0.39770115])

cross_val_score(LinearSVC(), explore[features_names], explore.author, cv=3, n_jobs=1)

array([ 0.30867034,  0.41863316,  0.3302682 ])

Не е много добре.

Да видим какво показва confusion матрицата.

predict_from_features = cross_val_predict(RandomForestClassifier(), explore[features_names], explore.author, cv=3, n_jobs=3)
print(accuracy_score(explore.author, predict_from_features))
plot_confusion_matrix(explore.author, predict_from_features, classes=authors, normalize=True)

0.400582256499

Да погледнем какво е разпределението на оригиналните класове.

explore.author.value_counts() / len(explore)

Едгар     0.403494
Мери      0.308698
Хауърд    0.287808
Name: author, dtype: float64

Моделите са по-лоши по точност от това да предсказваме само най-популярния клас.

Тези фичъри може и да са полезни в някой нелинеен модел в комбинация с други фичъри, но за сега ги оставяме.

Ще изчистим текста от мн. ч, времена и т.н.

За целта може да се ползва Stemming или Lemmatization.

На кратко, stemming:

Премахва ~ing, ~ed, ~es, ~ly и др. окончания от всички думи.

Lemmatization:

Работи с речник и намира правилната форма на думите.
Разбира и от синтаксис - знае дали думата е съществително или глагол, напр. "meeting".

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

stem = PorterStemmer()

explore['stemmed'] = explore.text.apply(lambda t: " ".join([stem.stem(w) for w in t.split()])) 

explore[['stemmed', 'text']].head()

print(explore.text[0])
print()
print(explore.stemmed[0])

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

thi process, however, afford me no mean of ascertain the dimens of my dungeon; as I might make it circuit, and return to the point whenc I set out, without be awar of the fact; so perfectli uniform seem the wall.

pipeline = Pipeline([
    ('features', CountVectorizer()),
    ('clf', LinearSVC())
])

cross_val_score(pipeline, explore.stemmed, train.author, cv=3, n_jobs=3)

# Резултати от същия pipeline използвайки колона text:
# array([ 0.78783701,  0.79635305,  0.79509579])

array([ 0.78477328,  0.78562672,  0.78482759])

Допълнителните фичъри не сработиха, стеминга също.¶

Остават да пробвам:

Оптимизиране на модела с CountVectorizer.
Добавяне на още фичъри, от латентни пространства (LDA) - topic modeling.
Word embeddings с невронни мрежи.
Стакинг на класификатори.

За сега ще разгледаме само оптимизирането на модела.

Има няколко разновидности на `CountVectorizer` и всичките имат голям набор параметри.¶

CountVectorizer
TfidfVectorizer
HashingVectorizer

Някои от основните параметри са:¶

analyzer=’word’
ngram_range=(1, 1)
token_pattern=’(?u)\b\w\w+\b’
max_df=1.0, min_df=1
max_features=None,
lowercase=True
preprocessor=None
tokenizer=None
stop_words=None

Искаме да пробваме с различни класификатори.¶

Пространството за изследване става огромно и ще си помогнем с RandomSearch от sklearn.

Освен това ще трябва да гледаме LogLoss за оценка, а не Accuracy, защото състезанието иска това и вероятностите имат значение.

Първо да опишем параметрите за търсене в трансфомацията (CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

params_count_word = {"features__ngram_range": [(1,1), (1,2), (1,3)],
                      "features__analyzer": ['word'],
                      "features__max_df":[1.0, 0.9, 0.8, 0.7, 0.6, 0.5],
                      "features__min_df":[2, 3, 5, 10],
                      "features__lowercase": [False, True],
                      "features__stop_words": [None, stopwords]}

params_count_char = {"features__ngram_range": [(1,4), (1,5), (1,6)],
                      "features__analyzer": ['char'],
                      "features__max_df":[1.0, 0.9, 0.8, 0.7, 0.6, 0.5],
                      "features__min_df":[2, 3, 5, 10],
                      "features__lowercase": [False, True],
                      "features__stop_words": [None, stopwords]}

def report(results, n_top=5):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss
def random_search():
    params = {
        "clf__C": [0.01, 0.1, 0.3, 1, 3, 10],
        "clf__class_weight": [None, 'balanced']
    }

    params.update(params_count_word)

    pipeline = Pipeline([
        ('features', CountVectorizer()),
        ('clf', LogisticRegression())
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, 
                                       scoring='neg_log_loss',
                                       n_iter=20, cv=3, n_jobs=4)

    random_search.fit(train.text, train.author)
    report(random_search.cv_results_)

# random_search()

Model with rank: 1 Mean validation score: -0.475 (std: 0.002) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 1), 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': None, 'clfC': 1}

Model with rank: 2 Mean validation score: -0.482 (std: 0.002) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': None, 'clfC': 1}

Model with rank: 3 Mean validation score: -0.486 (std: 0.001) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 1), 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 3}

Model with rank: 4 Mean validation score: -0.508 (std: 0.004) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featureslowercase': False, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 0.3}

Model with rank: 5 Mean validation score: -0.525 (std: 0.004) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 0.3}

Търсенето отнеме много време, заради това за char-grams ще пусна само едно трениране и оценяване с по-стандартни стойности на хипер параметрите.

pipeline = Pipeline([
    ('features', CountVectorizer(ngram_range=(3,5), analyzer='char')),
    ('clf', LogisticRegression())
])

print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3))
print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3, scoring='neg_log_loss'))

[ 0.81648284  0.8113699   0.82007663]
[-0.57635226 -0.57629693 -0.54682243]

По-лоши резултати с chars - няма да го изследваме.

За сметка на това ще пробваме да заменим CountVectorizer с по-големия му батко Tfidf.

Tfidf = Term-frequency inverse document-frequency¶

Идеята е да сложи тежести и значимост на всички думи или n-grams.
Напр. "новина" е доста често срещана дума и може да бъде в различни контексти.
За разлика от "електроенцефалограф", която е много по-рядко срещана и директно дава медицински контекст.

TF брои колко пъти се среща думата в текущия текст (пасаж, изречение, документ, семпъл).
IDF брои колко пъти тази дума се среща изцяло в корпуса с които тренираме.

Има и формула¶

$$ \operatorname {tfidf} (w,d) = \operatorname{tf} * \log \Big( \frac{n+1}{n_w + 1} \Big ) + 1 $$

където:

$w$ - конкретна дума
$d$ - документ на които правим трансформация
$n$ - бр. на документите в трейн сета
$n_w$ - бр. на документите в които $w$ се среща
$\operatorname{tf}$ - бр. срещанията на думата $w$ в документа $d$

Примерно¶

"екстраполирам" се среща 10 пъти в 1000 документа корпус.
"за" се среща в 900 от 1000.

$$ \text{tfidf("екстраполирам", "екстраполирам нещо си")} = 1 * log(1001 / 11) + 1 = 5.51 $$

$$ \text{tfidf("за", "отиде да тича за нещо си... за да му дойде акъла")} = 2 * log(1001 / 901) + 1 = 1.21 $$

print(1 * np.log(1001 / 11) + 1)
print(2 * np.log(1001 / 901) + 1)

5.51085950652
1.21049904341

tfidf = TfidfVectorizer()
print(tfidf.fit_transform(corpus).todense())
print(tfidf.vocabulary_)

[[ 0.39505606  0.          0.79011212  0.          0.39505606  0.          0.
   0.          0.          0.25215917  0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.64450299  0.          0.64450299  0.41137791  0.          0.          0.
   0.        ]
 [ 0.          0.40104275  0.          0.          0.          0.40104275
   0.          0.50867187  0.          0.          0.          0.
   0.50867187  0.40104275]
 [ 0.          0.34336615  0.          0.43551643  0.          0.34336615
   0.          0.          0.          0.27798449  0.43551643  0.43551643
   0.          0.34336615]]
{'billions': 2, 'and': 0, 'of': 9, 'dollars': 4, 'lot': 6, 'money': 8, 'we': 13, 'are': 1, 'going': 5, 'to': 12, 'make': 7, 'ot': 10, 'take': 11, 'care': 3}

CountVectorizer().fit(corpus).vocabulary_ == TfidfVectorizer().fit(corpus).vocabulary_

True

Горното сравнение ще рече, че CountVectorizer и TfidfVectorizer намирът един и същи речник или "торбата с думи".

Това е така защото TfidfVectorizer вътрешно ползва CountVectorizer а отгоре само добавя idf функционалността.

print(tfidf.idf_)

[ 1.91629073  1.51082562  1.91629073  1.91629073  1.91629073  1.51082562
  1.91629073  1.91629073  1.91629073  1.22314355  1.91629073  1.91629073
  1.91629073  1.51082562]

def random_search():
    params = {
        "clf__C": [0.01, 0.1, 0.3, 1, 3, 10],
        "clf__class_weight": [None, 'balanced']
    }

    params.update(params_count_word)

    pipeline = Pipeline([
        ('features', TfidfVectorizer()),
        ('clf', LogisticRegression())
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, 
                                       scoring='neg_log_loss',
                                       n_iter=20, cv=3, n_jobs=4)

    random_search.fit(train.text, train.author)
    report(random_search.cv_results_)

# random_search() # предишния най-добър резултат:  -0.475

Model with rank: 1 Mean validation score: -0.469 (std: 0.005) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 2, 'featuresmax_df': 1.0, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 10}

Model with rank: 2 Mean validation score: -0.471 (std: 0.006) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 3, 'featuresmax_df': 0.5, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': None, 'clfC': 10}

Model with rank: 3 Mean validation score: -0.483 (std: 0.008) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 5, 'featuresmax_df': 0.8, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 10}

Model with rank: 4 Mean validation score: -0.495 (std: 0.002) Parameters: {'featuresstop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'], 'featuresngram_range': (1, 2), 'featuresmin_df': 2, 'featuresmax_df': 0.6, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 10}

Model with rank: 5 Mean validation score: -0.522 (std: 0.005) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featuresmin_df': 10, 'featuresmax_df': 0.5, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clfclass_weight': 'balanced', 'clfC': 10}

Има леко подобрение в LogLoss.

Да пробваме да сменим и класификатора с друг класически за класификация на текст: Naive Bayes

def random_search():
    params = {
        "clf__alpha": [0.01, 0.1, 0.5, 1, 2]
    }

    params.update(params_count_word)

    pipeline = Pipeline([
        ('features', TfidfVectorizer()),
        ('clf', MultinomialNB())
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, 
                                       scoring='neg_log_loss',
                                       n_iter=20, cv=3, n_jobs=4)

    random_search.fit(train.text, train.author)
    report(random_search.cv_results_)

# random_search()  # Предишния най-добър резултат: -0.469

Model with rank: 1 Mean validation score: -0.423 (std: 0.003) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 2, 'featuresmax_df': 0.8, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Model with rank: 2 Mean validation score: -0.465 (std: 0.003) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 1), 'featuresmin_df': 3, 'featuresmax_df': 0.9, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Model with rank: 3 Mean validation score: -0.469 (std: 0.004) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featuresmin_df': 5, 'featuresmax_df': 0.9, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clf__alpha': 0.1}

Model with rank: 4 Mean validation score: -0.495 (std: 0.002) Parameters: {'featuresstop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'], 'featuresngram_range': (1, 3), 'featuresmin_df': 5, 'featuresmax_df': 0.8, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.1}

Model with rank: 5 Mean validation score: -0.496 (std: 0.004) Parameters: {'featuresstop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'], 'featuresngram_range': (1, 3), 'featuresmin_df': 5, 'featuresmax_df': 0.6, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Тук има още подобрение в метриката.

Искам да го пробвам и със stemming.

Освен това се вижда, че избира най-ниската предоставена стойност за alpha, може би трябва да пробвам с още по-ниски.

def random_search():
    params = {
        "clf__alpha": [0.001, 0.005, 0.01, 0.05, 0.1, 0.3]
    }

    params.update(params_count_word)

    pipeline = Pipeline([
        ('features', TfidfVectorizer()),
        ('clf', MultinomialNB())
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions=params, 
                                       scoring='neg_log_loss',
                                       n_iter=20, cv=3, n_jobs=4)

    random_search.fit(explore.stemmed, train.author)
    report(random_search.cv_results_)
    
# random_search()  # -0.423

Model with rank: 1 Mean validation score: -0.438 (std: 0.002) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 2, 'featuresmax_df': 0.6, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Model with rank: 2 Mean validation score: -0.443 (std: 0.004) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featuresmin_df': 3, 'featuresmax_df': 0.6, 'featureslowercase': True, 'featuresanalyzer': 'word', 'clf__alpha': 0.05}

Model with rank: 3 Mean validation score: -0.453 (std: 0.002) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featuresmin_df': 2, 'featuresmax_df': 1.0, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Model with rank: 4 Mean validation score: -0.471 (std: 0.003) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 5, 'featuresmax_df': 1.0, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Model with rank: 5 Mean validation score: -0.472 (std: 0.004) Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 3), 'featuresmin_df': 5, 'featuresmax_df': 0.5, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.05}

Откри приблизително същите параметри, но не успя да стигне напълно до същия резултат.¶

Ще използвам следния модел:

TfIdf + MultinomialNB, без стеминг на текста.

Mean validation score: -0.423 (std: 0.003)

Ще ползвам и следните параметри:

Parameters: {'featuresstop_words': None, 'featuresngram_range': (1, 2), 'featuresmin_df': 2, 'featuresmax_df': 0.8, 'featureslowercase': False, 'featuresanalyzer': 'word', 'clf__alpha': 0.01}

Последна проверка на този модел за LogLoss и Accuracy

from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('features', TfidfVectorizer(ngram_range=(1, 2), min_df=2,
                                 max_df=0.8, lowercase=False)),
    ('clf', MultinomialNB(alpha=0.01))
])

print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3))
print(cross_val_score(pipeline, train.text, train.author, cv=3, n_jobs=3, 
                      scoring='neg_log_loss'))

[ 0.83195466  0.83466135  0.83187739]
[-0.42530307 -0.418245   -0.42500535]

Трениране на модел и събмит¶

Първо да видим в какъв формат трябва да се подадат резултатите за тест

sample_submission = pd.read_csv("data/spooky-authors/sample_submission.zip")
sample_submission.head()

pipeline = pipeline.fit(train.text, train.author)

print(pipeline.predict_proba(test[:10].text))

[[  1.35048736e-02   9.84099382e-01   2.39574433e-03]
 [  9.56490435e-01   1.86160773e-03   4.16479578e-02]
 [  5.02125066e-03   1.81405097e-03   9.93164698e-01]
 [  7.80026971e-01   1.05174165e-03   2.18921288e-01]
 [  5.52830591e-01   7.76716406e-02   3.69497769e-01]
 [  9.36689592e-01   2.95332942e-04   6.30150746e-02]
 [  9.53634752e-01   4.36937890e-03   4.19958690e-02]
 [  4.81490244e-03   9.79569318e-01   1.56157800e-02]
 [  9.94703828e-01   1.07143053e-05   5.28545795e-03]
 [  4.92431648e-01   1.23390572e-01   3.84177780e-01]]

test_predictions = pipeline.predict_proba(test.text)

print(pipeline.classes_)

['Едгар' 'Мери' 'Хауърд']

submit_file = pd.DataFrame(test_predictions, columns=['EAP', 'MWS', 'HPL'], index=test.index)
submit_file.head(10)

submit_file.to_csv("data/spooky-authors/submit_Tfidf_MNB_text.csv")

Очакванията за събмита са да имаме скор някъде около 0.41 - 0.42.

Може да е малко по-добър защото при крос-валидацията тренирахме на 13к и тествахме 6к.

Сега трейн сета е целия: 19.5к

# Да хакнем ранкинга в кагъл?

print(test.text[:5].values)

[ 'Still, as I urged our leaving Ireland with such inquietude and impatience, my father thought it best to yield.'
 'If a fire wanted fanning, it could readily be fanned with a newspaper, and as the government grew weaker, I have no doubt that leather and iron acquired durability in proportion, for, in a very short time, there was not a pair of bellows in all Rotterdam that ever stood in need of a stitch or required the assistance of a hammer.'
 'And when they had broken down the frail door they found only this: two cleanly picked human skeletons on the earthen floor, and a number of singular beetles crawling in the shadowy corners.'
 'While I was thinking how I should possibly manage without them, one actually tumbled out of my head, and, rolling down the steep side of the steeple, lodged in the rain gutter which ran along the eaves of the main building.'
 'I am not sure to what limit his knowledge may extend.']

	text	author
id
id26305	This process, however, afforded me no means of...	EAP
id17569	It never once occurred to me that the fumbling...	HPL
id11008	In his left hand was a gold snuff box, from wh...	EAP
id27763	How lovely is spring As we looked from Windsor...	MWS
id12958	Finding nothing else, not even gold, the Super...	HPL

	stemmed	text
id
id26305	thi process, however, afford me no mean of asc...	This process, however, afforded me no means of...
id17569	It never onc occur to me that the fumbl might ...	It never once occurred to me that the fumbling...
id11008	In hi left hand wa a gold snuff box, from whic...	In his left hand was a gold snuff box, from wh...
id27763	how love is spring As we look from windsor ter...	How lovely is spring As we looked from Windsor...
id12958	find noth else, not even gold, the superintend...	Finding nothing else, not even gold, the Super...

	EAP	MWS	HPL
id
id02310	0.013505	0.984099	0.002396
id24541	0.956490	0.001862	0.041648
id00134	0.005021	0.001814	0.993165
id27757	0.780027	0.001052	0.218921
id04081	0.552831	0.077672	0.369498
id27337	0.936690	0.000295	0.063015
id24265	0.953635	0.004369	0.041996
id25917	0.004815	0.979569	0.015616
id04951	0.994704	0.000011	0.005285
id14549	0.492432	0.123391	0.384178

	text	author	words	unique_words	symbols	unique_symbols	capital_letters	only_capital_letter_words	average_word_lenght	digits	stop_words
id
id26305	This process, however, afforded me no means of...	Едгар	41	35	231	28	3	2	4.658537	0	16
id17569	It never once occurred to me that the fumbling...	Хауърд	14	14	71	22	1	0	4.142857	0	7
id11008	In his left hand was a gold snuff box, from wh...	Едгар	36	32	200	26	1	0	4.583333	0	15
id27763	How lovely is spring As we looked from Windsor...	Мери	34	32	206	30	4	0	5.088235	0	11
id12958	Finding nothing else, not even gold, the Super...	Хауърд	27	25	174	27	2	0	5.481481	0	11

	id	EAP	HPL	MWS
0	id02310	0.403494	0.287808	0.308698
1	id24541	0.403494	0.287808	0.308698
2	id00134	0.403494	0.287808	0.308698
3	id27757	0.403494	0.287808	0.308698
4	id04081	0.403494	0.287808	0.308698