The Data Revolution: Transitioning from Warehouses to Lakehouses for Enhanced Analytics

Analytics

The evolution of data analytics platforms has seen a significant shift from traditional data warehouses to modern data lakehouses, driven by the need for more flexible and scalable data management solutions.

The Shift in Data Management

Historically, organizations relied heavily on data warehouses for structured data analysis. These systems excelled at executing specific queries, particularly in business intelligence (BI) and reporting environments. However, as data volumes grew and diversified—encompassing structured, semi-structured, and unstructured data—the limitations of traditional data warehouses became apparent.In the mid-2000s, businesses began to recognize the potential of harnessing vast amounts of data from various sources for analytics and monetization. This led to the emergence of the “data lake,” designed to store raw data without enforcing strict quality controls. While data lakes provided a solution for storing diverse data types, they fell short in terms of data governance and transactional capabilities.

The Role of Object Storage

The introduction of object storage, particularly with the standardization of the S3 API, has transformed the landscape of data analytics. Object storage allows organizations to store a wide array of data types efficiently, making it an ideal foundation for modern analytics platforms.Today, many analytics solutions, such as Greenplum, Vertica, and SQL Server 2022, have integrated support for object storage through the S3 API. This integration enables organizations to utilize object storage not just for backups but as a primary data repository, facilitating a more comprehensive approach to data analytics.

The Benefits of Data Lakehouses

The modern data lakehouse architecture combines the best features of data lakes and data warehouses. It allows for the decoupling of storage and compute resources, supporting a variety of analytical workloads. This flexibility means that organizations can access and analyze their entire data set efficiently using standard S3 API calls.

Key Advantages:

  • Scalability: Object storage can grow with the organization’s data needs without the constraints of traditional storage solutions.
  • Versatility: Supports diverse data types and analytics use cases, making it suitable for various business applications.
  • Cost-Effectiveness: Provides a more affordable storage solution, particularly for large volumes of data.

Conclusion

The evolution from data warehouses to data lakehouses represents a significant advancement in data analytics capabilities. By leveraging object storage and the S3 API, organizations can now manage their data more effectively, enabling deeper insights and better decision-making. For more detailed insights and use cases, explore Cloudian’s resources on hybrid cloud storage for data analytics.

Cyber Whale is a Moldovan agency specializing in building custom Business Intelligence (BI) systems that empower businesses with data-driven insights and strategic growth.

Let us help you with our BI systems, let us know at [email protected]

Boost Your Java Skills: Must-Know Code Hacks

Java is one of the most versatile programming languages, and mastering its essential code techniques can significantly streamline your development process.

Java

Whether you’re a seasoned developer or just starting out, having a cheat sheet with key Java snippets at your fingertips can boost efficiency and help tackle complex challenges faster.

Below, we break down some must-know code categories that every developer should keep in their toolkit.

1. Mastering Loops for Iteration

Loops are essential for efficiently processing data in Java. Let’s start with the for loop, which allows iteration over a range of values.

// Simple iteration from 0 to 9
for (int index = 0; index < 10; index++) {
    System.out.print(index);
}
// Output: 0123456789

In cases where you need to manipulate multiple variables within a loop, you can do something like this:

// Iteration with two variables in the loop
for (int i = 0, j = 0; i < 3; i++, j--) {
    System.out.print(j + "|" + i + " ");
}
// Output: 0|0 -1|1 -2|2

This combination of variables and conditions inside a loop can help optimize complex logic in a more readable and concise manner.

2. Working with Lists for Flexible Data Storage

Lists are a common data structure in Java, allowing you to store and retrieve elements dynamically. Here’s how you can use ArrayList:

// Initialize a list of integers
List<Integer> numbers = new ArrayList<>();

// Adding elements to the list
numbers.add(2);
numbers.add(5);
numbers.add(8);

// Accessing the first element
System.out.println(numbers.get(0)); // Output: 2

// Iterating through the list using an index
for (int i = 0; i < numbers.size(); i++) {
    System.out.println(numbers.get(i));
}

// Removing elements from the list
numbers.remove(numbers.size() - 1); // Removes the last element
numbers.remove(0); // Removes the first element

// Iterating through the modified list
for (Integer num : numbers) {
    System.out.println(num); // Output: 5
}

This example illustrates how to add, access, and remove elements from an ArrayList.

3. Using Deques for Flexible Data Handling

Java’s Deque (Double-Ended Queue) provides flexible data manipulation. Here’s an adapted example:

// Create a deque for strings
Deque<String> animals = new ArrayDeque<>();

// Add elements to the deque
animals.add("Dog");
animals.addFirst("Cat");
animals.addLast("Horse");

// Display the contents of the deque
System.out.println(animals); // Output: [Cat, Dog, Horse]

// Peek at the first element without removal
System.out.println(animals.peek()); // Output: Cat

// Remove and return the first element
System.out.println(animals.pop()); // Output: Cat

This example demonstrates how to manipulate elements at both ends of the Deque.

4. Mathematical Operations with the Math Class

Java’s Math class offers a range of mathematical functions. Here’s a rephrased set of examples:

// Find the maximum and minimum of two numbers
System.out.println(Math.max(8, 15));   // Output: 15
System.out.println(Math.min(8, 15));   // Output: 8

// Calculate the absolute value and square root
System.out.println(Math.abs(-7));      // Output: 7
System.out.println(Math.sqrt(25));     // Output: 5.0

// Calculate power and round to the nearest integer
System.out.println(Math.pow(3, 4));    // Output: 81.0
System.out.println(Math.round(5.7));   // Output: 6

// Perform trigonometric functions using radians
System.out.println(Math.sin(Math.toRadians(45)));  // Output: 0.707
System.out.println(Math.cos(Math.toRadians(45)));  // Output: 0.707

These examples highlight how to perform various mathematical operations and conversions with Java’s Math class.

Conclusions

Mastering essential Java code techniques is crucial for elevating your development skills and improving code efficiency. By delving into various coding practices, you can refine your programming abilities and produce more robust applications. Here’s a breakdown of key areas where Java techniques can make a significant impact:

  1. Handling Loops and Iterations:
    • Efficient loop handling is fundamental for processing data and automating repetitive tasks. By understanding different types of loops, such as for, while, and do-while, you can optimize your code for performance and readability.
    • Example: Utilize the for loop to iterate through a range of values or a collection, ensuring minimal computational overhead.
  2. Managing Collections:
    • Java Collections Framework provides versatile data structures like ArrayList, HashMap, and Deque for managing groups of objects. Mastering these collections allows you to efficiently store, retrieve, and manipulate data.
    • Example: Use ArrayList for dynamic arrays where elements can be added or removed, and HashMap for key-value pairs to quickly access data based on a unique key.
  3. Performing Complex Mathematical Calculations:
    • Java offers a suite of mathematical functions and constants via the Math class, such as Math.max(), Math.sqrt(), and Math.pow(). Leveraging these functions helps in performing accurate and efficient calculations.
    • Example: Calculate the square root of a number using Math.sqrt() or find the power of a number using Math.pow() for precise mathematical operations.
  4. Optimizing Code with Advanced Techniques:
    • Advanced techniques like multi-threading, generics, and exception handling play a critical role in writing efficient and error-free code. By understanding and implementing these techniques, you can handle complex scenarios and improve application performance.
    • Example: Use multi-threading to perform parallel tasks, generics for type-safe collections, and exception handling to manage errors gracefully.
  5. Applying Best Practices for Cleaner Code:
    • Adhering to best practices like SOLID principles, clean code guidelines, and proper logging ensures that your code is maintainable, scalable, and easy to understand.
    • Example: Follow SOLID principles to design robust and flexible object-oriented systems, and use logging libraries to track application behavior and troubleshoot issues.

These Java code techniques will not only enhance your development skills but also make you a more effective and confident Java developer. By applying these practices, you can streamline your coding process, tackle various programming challenges with ease, and contribute to the creation of high-quality software solutions. Keep these techniques in mind as you continue to grow and excel in your Java programming journey.

Let us develop your Java application!

Let us know at [email protected]

Mastering Java: Essential Code Techniques for Modern Development

Java

Java Roadmap

Mastering Java requires a step-by-step approach, moving from the basics to advanced topics. Here’s a streamlined roadmap to guide your journey:

1. Setup and Tools

  • Linux: Learn basic commands.
  • Git: Master version control for collaboration.
  • IDEs: Familiarize yourself with:
    • IntelliJ IDEA, Eclipse, or VSCode.

2. Core Java Concepts

  • OOP: Understand classes, objects, inheritance, and polymorphism.
  • Arrays & Strings: Work with data structures and string manipulation.
  • Loops: Control flow with for, while, and do-while.
  • Interfaces & Packages: Organize and structure code.

3. File I/O and Collections

  • File Handling: Learn file operations using I/O Streams.
  • Collections Framework: Work with Lists, Maps, Stacks, and Queues.
  • Optionals: Avoid null pointer exceptions with Optional.

4. Advanced Java Concepts

  • Dependency Injection: Understand DI patterns.
  • Design Patterns: Learn common patterns like Singleton and Factory.
  • JVM Internals: Learn memory management and garbage collection.
  • Multi-Threading: Handle concurrency and threads.
  • Generics & Exception Handling: Write type-safe code and handle errors gracefully.
  • Streams: Work with functional programming using Streams.

5. Testing and Debugging

  • Unit & Integration Testing: Use JUnit/TestNG for testing.
  • Debugging: Learn debugging techniques.
  • Mocking: Use libraries like Mockito for test isolation.

6. Databases

  • Database Design: Learn to design schemas and write efficient queries.
  • SQL & NoSQL: Work with relational (JDBC) and non-relational databases.
  • Schema Migration Tools: Use Flyway or Liquibase for migrations.
Java

7. Clean Code Practices

  • SOLID Principles: Write maintainable and scalable code.
  • Immutability: Ensure thread-safe and predictable objects.
  • Logging: Implement effective logging for debugging.

8. Build Tools

  • Learn to use Maven, Gradle, or Bazel for project builds.

9. HTTP and APIs

  • HTTP Protocol & REST API: Design scalable APIs.
  • GraphQL: Explore efficient querying with GraphQL.

10. Frameworks

  • Spring Boot: Build production-ready applications.
  • Play & Quarkus: Learn lightweight, cloud-native frameworks.

Let us develop your Java application!

Let us know at [email protected]

Head of data – Job

Job

Job Description

Because SaaS does not satisfy most of specific needs, we need to market new kind of CDP to empower data management.

Requirements:

  • Experience with ETL, data pipelines.
  • Knowledge of SQL
  • Knowledge of GenAI, LLMs, a bit of MLOps skills to deploy LLMs.
  • At least basic: Python, Javascript
  • English leve – B1+
  • Experience with Docker, Git-actions, Gitflow, Terraform, Terraform-cloud
  • Ability to grasp new concepts fast.

We can consider someone junior, but you really should have at least academic experience with the technologies mentioned above.

What you’ll get:

  • Pleasant atmosphere for personal and professional growth
  • Good salary and flexible hours
  • Employees Stock Options Program
  • Flexible hours
  • Fun when working and responsible attitude

Visit us to learn more!

What’s next for Cyber Whale

Next

Many of you may wonder what is next in the field of IT? By 2026, more than 80% of organizations will actively utilize GenAI models and APIs and/or applications with GenAI, compared to less than 5% at the beginning of 2023. Recently, there has been a rapidly growing organizational structure for AI management, as discussed below.

According to Gartner, the key IT trends for 2024 and beyond are:

Democratized Generative AI:

Democratized Generative AI aims to make AI technology accessible to a wider range of users, including non-specialists and individuals without extensive technical knowledge. 

AI Trust, Risk, and Security Management (AI TRiSM):

AI TRiSM is a market segment for AI management products and services, including AI audit and monitoring tools, as well as management frameworks.

By 2026, organizations using AI TRiSM tools are forecasted to increase the accuracy of decision-making by filtering out up to 80% of irrelevant information.

Continuous Threat Exposure Management (CTEM):

CTEM is a cybersecurity process that uses attack simulations to identify and mitigate threats to an organization’s networks and systems.

By 2026, the widespread adoption of CTEM could improve enterprise cybersecurity levels by threefold.

Sustainable Technology:

Sustainable technologies are innovations that consider natural resources and contribute to economic and social development, aiming to significantly reduce environmental risks.

In the next years, there is an expected increase in the reliance on sustainable technologies, impacting the salaries of IT directors based on their readiness to use these technologies.

Platform Engineering:

Platform engineering is a technology approach that accelerates application delivery and enhances business value by providing infrastructure service automation.

AI-Augmented Development:

AI-Augmented Development refers to a set of tools and platforms for developing applications using AI, enabling developers to create applications more efficiently, quickly, and reliably.

Industry Cloud Platforms:

These platforms achieve specific industry business outcomes by integrating existing SaaS, PaaS, and IaaS services into a comprehensive offering with composable capabilities.

In the near future, by 2027, the popularity of Industry Cloud Platforms within organizations is forecasted to increase fivefold compared to 2023.

Intelligent Applications and Augmented Connected Workforce:

Intelligent applications accelerate and automate work processes, sometimes replacing low-skilled or insufficiently skilled workers.

By the end of 2028, 25% of IT directors are expected to use ACWF strategies, accelerating the competency growth of subordinates by 50%.

Machine Customers:

Machine customers refer to machines that replace real human customers to perform tasks such as automated ordering or purchasing.

By 2030, the forecast predicts significant growth in this industry, potentially surpassing the revenue of digital commerce.

Among other trends:

  • Augmented Reality (AR) technologies are expected to experience a breakthrough from 2025.
  • Continued development and integration of metaverses, including the use of headsets and augmented reality.
  • Continued growth in SaaS with potential breakthroughs.
  • Development and support of LLM models.
  • Advancements in Quantum Computing, although the efficient future is still quite distant.
  • Internet of Things (IoT) – communication between multiple devices for coordinated operation without human intervention.
  • Remote learning (EdTech).
  • Control and cybersecurity of Big Data.
  • Cross-platform UI, Compose Multiplatform.
  • Continued development of native technologies – Swift, Kotlin, Aurora OS, ROSA.
  • Neurointerfaces requiring AI development.
  • Cloud technologies – continued expansion and demand for cloud computing specialists, data analysts, and cloud engineers.
  • Digital marketing – focus on SEO, transparency, and influencer marketing, considering Google’s Privacy Sandbox and cookie abandonment.
  • Product managers may become more popular due to future advertising restrictions, requiring products to immediately attract attention.
  • Growth and complexity of AI in smart homes, autopilots, and drones, increasing demand for data engineers, ML, AI.
  • Development of automated hiring systems in HR.
  • Growing popularity of DevOps for accelerated development processes.
  • Emergence of the prompt engineer profession.
  • Data communicator and storyteller – a subset of data analytics that may become popular, translating and presenting data in easily understandable packages.
  • UX and UI designers, especially with the rise of low-code, will continue to be popular, making software intuitive, organic, and easily manageable.

Visit us and Learn more about our projects in the next years!

 

Code of conduct at Cyber Whale


A. Basic Rules of Work Ethics

  1. To work at Cyber Whale, it is essential to consider compliance and adhere to necessary norms when dealing with colleagues and clients. Diligent, timely, and clear adherence to client and tech lead preferences ensures the prompt achievement of results with minimal revisions.
  2. Crucial strategies and decisions essential for the company’s operation (in technical, ethical, financial, and organizational terms) are not discussed with clients without notifying and involving the managers.
  3. Every employee in our company can be confident that they will be evaluated solely based on their professional qualities. We stand against discrimination on any grounds and appreciate the individuality, personal stance, and cultural characteristics of each colleague. In case of any observed discrimination within the team, we take immediate measures to protect the rights of the colleague facing discrimination.

B. Confidentiality, Privacy, and Transparency

  1. The company’s policy emphasizes complete transparency and honest feedback with our clients, as well as the clients themselves and employees engaged in relevant projects. At the same time, we highly respect the confidentiality of our colleagues and guarantee that no personal data of colleagues, except those necessary for work activities, will leave the company. You can fully trust both our managers and the clients you work with.
  2. All work-related data handled by company employees is confidential, and all personal data of the employees themselves is private and is not to be disclosed to third parties, except in the case of intra-corporate interactions within the scope of the contract or special legal proceedings. Managers and clients, on their part, are also obligated to adhere to this directive.
  3. Adhere to digital security. When using the internet from a work computer, ensure the safety of corporate data you are working with, whether they are on your computer or directly accessible online through various accounts.
  4. We guarantee transparency in using artificial intelligence technologies in carrying out work tasks. The client must be informed that, in performing tasks such as content generation, coding, or management, we employ AI assistance.
  5. Whenever we collect others’ data, record audio/video materials with colleagues or clients, we always seek the person’s permission. Anything otherwise goes against the values of our company and our clients.

C. Organization of Working Time: General Provisions

  1. We provide a flexible work schedule, allowing the choice of working location (office or remote) and working hours from 10:00 to 19:00, with a one-hour lunch break. Short breaks for rest during working hours are allowed, and a slight adjustment to the boundaries of the working day is permissible.
  2. Communication among colleagues is welcome, but during working hours, focus should be solely on work-related topics, ensuring that a colleague can allocate time to you either immediately or later. By agreement, work-related issues can be discussed until 8 p.m., while other matters are better addressed before 10 a.m. The exception is high urgency, emergency situations, acute health deterioration, etc. Work-related issues are not discussed on weekends (except for compensatory time off or part-time work).
  3. Before taking leave, it is necessary to inform the department head and HR at least 2 weeks in advance, and before resignation, one month in advance. In this case, relevant applications (in 2 copies) should be prepared and signed by the department head or director after submission. Application templates can be obtained from HR.
  4. It is better to submit an application for sick leave than to jeopardize the project and the client with slow and poor-quality work.
  5. Before taking leave, it is necessary to notify the department head and HR in advance, and on the nearest working day, compensate for the time off.
  6. For us, the balance between work and life matters. We do not force our colleagues to live for work, spending more time on it than the regulated hours or tackling unmanageable tasks. We do not obstruct their desire to take a vacation or sick leave. Regular extracurricular events are held to help employees feel the company’s care, relax, enjoy good vibes, and interact with colleagues. We support employees’ desire to appreciate the results of their work at Cyber Whale in both work and non-working hours.
  7. When sending any application to the department head, also notify HR and PM, including placing them in copy when sending an email or message via messenger.
  8. For the most effective project coordination, if you live in the city where the company’s headquarters is located, it is recommended to work at the company’s office regularly, at least once a week. In other cases, rely on the goal-setting of the department head.

D. Organization of Working Time: Daily Provisions

  1. It is important to value each other’s time. Approach colleagues if you are sure that the information you provide will be informative, acceptable, unintrusive, and timely. Strive to structure thoughts clearly and concisely.
  2. Respect for time is one of the reasons why we actively use information search in browsers and with the help of AI. Practice shows that this is an effective strategy that significantly reduces micromanagement, saves managers’ and tech leads’ time, and positively influences colleagues’ ability to ask the right questions and efficiently find the necessary information. It is better to approach the tech lead or manager with well-clarified information and ensure that there are no remaining questions. These questions, compiled in a list, are discussed in subsequent calls or video conferences, after which colleagues return to improving previous tasks or completing new ones.
  3. Project management primarily relies on voice and video communication with the project group or individual colleagues. This allows for clearer conveyance of all project and task nuances, more precise regulation of work, improved coordination, and better time management, eliminating downtime due to lengthy and disorganized text-based discussions.

PickOnePic

Privacy Policy

built the PickOnePic app as a Free app. This SERVICE is provided by at no cost and is intended for use as is.

This page is used to inform visitors regarding my policies with the collection, use, and disclosure of Personal Information if anyone decided to use my Service.

If you choose to use my Service, then you agree to the collection and use of information in relation to this policy. The Personal Information that I collect is used for providing and improving the Service. I will not use or share your information with anyone except as described in this Privacy Policy.

The terms used in this Privacy Policy have the same meanings as in our Terms and Conditions, which is accessible at PickOnePic unless otherwise defined in this Privacy Policy.

Information Collection and Use

For a better experience, while using our Service, I may require you to provide us with certain personally identifiable information. The information that I request will be retained on your device and is not collected by me in any way.

The app does use third party services that may collect information used to identify you.

Link to privacy policy of third party service providers used by the app

Log Data

I want to inform you that whenever you use my Service, in a case of an error in the app I collect data and information (through third party products) on your phone called Log Data. This Log Data may include information such as your device Internet Protocol (“IP”) address, device name, operating system version, the configuration of the app when utilizing my Service, the time and date of your use of the Service, and other statistics.

Cookies

Cookies are files with a small amount of data that are commonly used as anonymous unique identifiers. These are sent to your browser from the websites that you visit and are stored on your device’s internal memory.

This Service does not use these “cookies” explicitly. However, the app may use third party code and libraries that use “cookies” to collect information and improve their services. You have the option to either accept or refuse these cookies and know when a cookie is being sent to your device. If you choose to refuse our cookies, you may not be able to use some portions of this Service.

Service Providers

I may employ third-party companies and individuals due to the following reasons:

  • To facilitate our Service;
  • To provide the Service on our behalf;
  • To perform Service-related services; or
  • To assist us in analyzing how our Service is used.

I want to inform users of this Service that these third parties have access to your Personal Information. The reason is to perform the tasks assigned to them on our behalf. However, they are obligated not to disclose or use the information for any other purpose.

Security

I value your trust in providing us your Personal Information, thus we are striving to use commercially acceptable means of protecting it. But remember that no method of transmission over the internet, or method of electronic storage is 100% secure and reliable, and I cannot guarantee its absolute security.

Links to Other Sites

This Service may contain links to other sites. If you click on a third-party link, you will be directed to that site. Note that these external sites are not operated by me. Therefore, I strongly advise you to review the Privacy Policy of these websites. I have no control over and assume no responsibility for the content, privacy policies, or practices of any third-party sites or services.

Children’s Privacy

These Services do not address anyone under the age of 13. I do not knowingly collect personally identifiable information from children under 13. In the case I discover that a child under 13 has provided me with personal information, I immediately delete this from our servers. If you are a parent or guardian and you are aware that your child has provided us with personal information, please contact me so that I will be able to do necessary actions.

Changes to This Privacy Policy

I may update our Privacy Policy from time to time. Thus, you are advised to review this page periodically for any changes. I will notify you of any changes by posting the new Privacy Policy on this page. These changes are effective immediately after they are posted on this page.

Contact Us

If you have any questions or suggestions about my Privacy Policy, do not hesitate to contact me at [email protected].

Cyber Whale hits IT Park in Republic of Moldova

SaaS
Proud to announce that Cyber Whale LLC has been incorporated in Republic of Moldova and entered the IT Park.

The benefits of being present in the IT Park are the following:

  • The unified tax rate which is just 7% of the sales (VAT not included).
  • A straightforward procedure to become a member of the park.
  • Easier reporting (just 1 monthly tax report , instead of 4 reports).
  • A great opportunity for investors.
  • 0% salary tax for employees, 0% medical tax, 0% social tax – all included in 7% tax rate.

Cyber Whale is a digital agency rendering Digital and Creative services as well as Machine Learning and Business Intelligence services, operating worldwide from Republic of Moldova.

How to write trained Word2Vec model to CSV with DeepLearning4j

I used DeepLearning4j to train word2vec model. Then I had to save the dictionary to CSV so I can run some clustering algorithms on it.

Sounded like a simple task, but it took a while, and here is the code to do this:

 

   private void writeIndexToCsv(String csvFileName, Word2Vec model) {

        CSVWriter writer = null;
        try {
            writer = new CSVWriter(new FileWriter(csvFileName));
        } catch (IOException e) {
            e.printStackTrace();
        }

        VocabCache&lt;VocabWord&gt; vocCache =  model.vocab();
        Collection&lt;VocabWord&gt; wrds = vocCache.vocabWords();

        for(VocabWord w : wrds) {
            String s = w.getWord();
            System.out.println(&quot;Looking into the word:&quot;);
            System.out.println(s);
            StringBuilder sb = new StringBuilder();
            sb.append(s).append(&quot;,&quot;);
            double[] wordVector = model.getWordVector(s);
            for(int i = 0; i &lt; wordVector.length; i++) {
                sb.append(wordVector[i]).append(&quot;,&quot;);
            }

            writer.writeNext(sb.toString().split(&quot;,&quot;), false);
        }

        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

Xanda BI Toolkit: clustering

In the previous post we introduced the toolkit release to open source and the general idea behind the project, now I would like to share clustering implementation.

At this point we implemented 3 clustering algorithms:

  • K-means
  • DBSCAN
  • Hierarchical clustering

K-means

Very straight-forward algorithm

#clustering algorithms
class KMeansAlgorithm(Step):
    def __init__(self):
        self.params = settings[&quot;clustering_settings&quot;][&quot;kmeans_params&quot;]
        self.newColumn = settings[&quot;clustering_settings&quot;][&quot;target_column&quot;]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])

        km = KMeans(**self.params)
        km.fit(df)
        clusters = km.labels_.tolist()
        df[self.newColumn] = clusters
        pprint(df.head(settings[&quot;rows_to_debug&quot;]))
        return df

K-means is memory-friendly and provides good output resulrs.

DBSCAN

Although DBSCAN is noise reduction based algorithm it is capable to self-organise clusters.

class DBScanAlgorithm(Step):
    def __init__(self):
        self.params = settings[&quot;clustering_settings&quot;][&quot;dbscan_params&quot;]
        self.newColumn = settings[&quot;clustering_settings&quot;][&quot;target_column&quot;]

    def execute(self, df):
        pprint(self.__class__.__name__)
        pprint(inspect.stack()[0][3])


        loc_df = StandardScaler().fit_transform(df)
        db = DBSCAN(**self.params).fit(loc_df)
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        clusters = db.labels_.tolist()
        print(clusters)


        loc_df[self.newColumn] = clusters
        pprint(df.head(settings[&quot;rows_to_debug&quot;]))

        return loc_df