Skip to content

declearn.dataset.utils.sparse_from_file

Return a scipy sparse matrix loaded from a text file.

See the sparse_to_file counterpart function to create reloadable sparse data dump files.

Parameters:

Name Type Description Default
path str

Path to the sparse matrix dump file.

required

Returns:

Name Type Description
matrix scipy.sparse.spmatrix

Sparse matrix restored from file, the exact type of which being defined by said file.

Raises:

Type Description
KeyError

If the file's header cannot be JSON-parsed or does not conform to the expected standard.

TypeError

If the documented sparse matrix type is not supported, i.e. "bsr", "csv", "csc", "coo", "dia", "dok" or "lil".

Note

The format used is mostly similar to the SVMlight one (see for example sklearn.datasets.load_svmlight_file), but the file must store a single matrix rather than a (X, y) pair of arrays. It must also record some metadata in its header, which are notably used to restore the initial matrix's dtype and type of sparse format.

Source code in declearn/dataset/utils/_sparse.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def sparse_from_file(path: str) -> spmatrix:
    """Return a scipy sparse matrix loaded from a text file.

    See the [`sparse_to_file`][declearn.dataset.utils.sparse_to_file]
    counterpart function to create reloadable sparse data dump files.

    Parameters
    ----------
    path: str
        Path to the sparse matrix dump file.

    Returns
    -------
    matrix: scipy.sparse.spmatrix
        Sparse matrix restored from file, the exact type
        of which being defined by said file.

    Raises
    ------
    KeyError
        If the file's header cannot be JSON-parsed or does not
        conform to the expected standard.
    TypeError
        If the documented sparse matrix type is not supported,
        i.e. "bsr", "csv", "csc", "coo", "dia", "dok" or "lil".


    Note
    ----
    The format used is mostly similar to the SVMlight one (see for example
    `sklearn.datasets.load_svmlight_file`), but the file must store a single
    matrix rather than a (X, y) pair of arrays. It must also record some
    metadata in its header, which are notably used to restore the initial
    matrix's dtype and type of sparse format.
    """
    with open(path, "r", encoding="utf-8") as file:
        # Read and parse the file's header.
        try:
            head = json.loads(file.readline())
        except json.JSONDecodeError as exc:
            raise KeyError("Invalid header for sparse matrix file.") from exc
        if any(key not in head for key in ("stype", "dtype", "shape")):
            raise KeyError("Invalid header for sparse matrix file.")
        if head["stype"] not in SPARSE_TYPES.values():
            raise TypeError(f"Invalid sparse matrix type: '{head['stype']}'.")
        # Instantiate a lil_matrix abiding by the header's specs.
        lil = lil_matrix(tuple(head["shape"]), dtype=head["dtype"])
        cnv = int if lil.dtype.kind == "i" else float
        # Iteratively parse and fill-in row data.
        for rix, row in enumerate(file):
            row = row.strip(" \n")
            if not row:  # all-zeros row
                continue
            for field in row.split(" "):
                ind, val = field.split(":")
                lil[rix, int(ind)] = cnv(val)
    # Convert the matrix to its initial format and return.
    return getattr(lil, f"to{head['stype']}")()