In the previous post we introduced the toolkit release to open source and the general idea behind the project, now I would like to share clustering implementation.
At this point we implemented 3 clustering algorithms:
- K-means
- DBSCAN
- Hierarchical clustering
K-means
Very straight-forward algorithm
class KMeansAlgorithm(Step):
def __init__(self):
self.params = settings["clustering_settings"]["kmeans_params"]
self.newColumn = settings["clustering_settings"]["target_column"]
def execute(self, df):
pprint(self.__class__.__name__)
pprint(inspect.stack()[0][3])
km = KMeans(**self.params)
km.fit(df)
clusters = km.labels_.tolist()
df[self.newColumn] = clusters
pprint(df.head(settings["rows_to_debug"]))
return df
K-means is memory-friendly and provides good output resulrs.
DBSCAN
Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.
class DBScanAlgorithm(Step):
def __init__(self):
self.params = settings["clustering_settings"]["dbscan_params"]
self.newColumn = settings["clustering_settings"]["target_column"]
def execute(self, df):
pprint(self.__class__.__name__)
pprint(inspect.stack()[0][3])
loc_df = StandardScaler().fit_transform(df)
db = DBSCAN(**self.params).fit(loc_df)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
clusters = db.labels_.tolist()
print(clusters)
# If loc_df is a numpy array, convert to DataFrame to add a column
if not isinstance(loc_df, pd.DataFrame):
loc_df = pd.DataFrame(loc_df, columns=df.columns)
loc_df[self.newColumn] = clusters
pprint(df.head(settings["rows_to_debug"]))
return loc_df