In the previous post we introduced the toolkit release to open source and the general idea behind the project, now I would like to share clustering implementation.
At this point we implemented 3 clustering algorithms:
- K-means
- DBSCAN
- Hierarchical clustering
K-means
Very straight-forward algorithm
#clustering algorithms class KMeansAlgorithm(Step): def __init__(self): self.params = settings["clustering_settings"]["kmeans_params"] self.newColumn = settings["clustering_settings"]["target_column"] def execute(self, df): pprint(self.__class__.__name__) pprint(inspect.stack()[0][3]) km = KMeans(**self.params) km.fit(df) clusters = km.labels_.tolist() df[self.newColumn] = clusters pprint(df.head(settings["rows_to_debug"])) return df
K-means is memory-friendly and provides good output resulrs.
DBSCAN
Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.
class DBScanAlgorithm(Step): def __init__(self): self.params = settings["clustering_settings"]["dbscan_params"] self.newColumn = settings["clustering_settings"]["target_column"] def execute(self, df): pprint(self.__class__.__name__) pprint(inspect.stack()[0][3]) loc_df = StandardScaler().fit_transform(df) db = DBSCAN(**self.params).fit(loc_df) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True clusters = db.labels_.tolist() print(clusters) loc_df[self.newColumn] = clusters pprint(df.head(settings["rows_to_debug"])) return loc_df