View on GitHub

rdga_4k

Random Data Generator Algorithm for Clustering

rdga_4k (Random Data Generator Algorithm for Clustering)

The rdga_4k library generates synthetic datasets tailored for clustering algorithm applications. It provides two core functions, catbird and canard, for customizable dataset generation with support for binary and categorical features.

🔥 Features

Synthetic Data for Clustering: Tailored datasets for clustering algorithm research and testing.
Flexible Configurations: Supports binary and categorical feature generation.
Noise and Intersection Control: Fine-tune feature noise and cluster intersections.
Reproducible Results: Ensure consistency with random seed support.

🛠 Installation

Install using git and pip install:

pip install git+https://github.com/aquinordg/rdga_4k.git

🚀 Usage

Import the library and use the catbird or canard functions to generate datasets:

from rdga_4k import catbird, canard

# Example using catbird
X, y = catbird(
    n_feat=10,
    feat_sig=[3, 2],
    rate=[50, 50],
    lmbd=0.7,
    eps=0.1,
    random_state=42
)

# Example using canard
X, y = canard(
    n_feat=10,
    n_cat=3,
    rate=[50, 50],
    lmbd=5,
    eps=0.2,
    random_state=42
)

📜 Functions Overview

`catbird`

Generates a labeled dataset with binary features based on feature clustering.

Parameters

n_feat (int): Number of total features. Must be greater than 1.
feat_sig (list[int]): List of the number of significant features per cluster.
rate (list[int]): Number of examples per cluster.
lmbd (float): Intersection factor between features. Default is 0.8.
eps (float): Noise rate for feature generation. Default is 0.2.
random_state (int or RandomState, optional): Seed for reproducibility.

Returns

X (np.ndarray): Binary matrix representing the features.
y (np.ndarray): Array of cluster labels.

Example

X, y = catbird(n_feat=10, feat_sig=[3, 2], rate=[50, 50], lmbd=0.7, eps=0.1, random_state=42)

`canard`

Generates a labeled dataset with categorical features divided into multiple categories.

Parameters

n_feat (int): Number of total features. Must be greater than 1.
n_cat (int): Number of categories for each feature. Must be greater than 1.
rate (list[int]): Number of examples per cluster.
lmbd (int): Intersection factor between features. Default is 10.
eps (float): Noise rate for feature generation. Default is 0.3.
random_state (int or RandomState, optional): Seed for reproducibility.

Returns

X (np.ndarray): Matrix of categorical features.
y (np.ndarray): Array of cluster labels.

Example

X, y = canard(n_feat=10, n_cat=3, rate=[50, 50], lmbd=5, eps=0.2, random_state=42)

📄 License

This project is licensed under the MIT License.

🤝 Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch.
Commit your changes.
Push to the branch.
Open a pull request.

For questions or information, feel free to reach out at: aquinordga@gmail.com.

👨‍💻 Author

Developed by AQUINO, R. D. G.

💬 Feedback

Feel free to open an issue or contact me for feedback or feature requests. Your input is highly appreciated!