API Reference
cholla_chem
cholla_chem initialization.
CIRpyNameResolver
Bases: ChemicalNameResolver
Resolver using Chemical Identity Resolver via CIRPy.
Source code in cholla_chem/main.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using cirpy.
Source code in cholla_chem/main.py
210 211 212 213 214 215 216 217 | |
ChemNameCorrector
Main class for correcting OCR errors in chemical names.
This class orchestrates the correction process by: 1. Applying configured correction strategies 2. Generating candidate corrections 3. Scoring candidates 4. Optionally validating with external tools 5. Returning ranked results
Example
corrector = ChemNameCorrector() results = corrector.correct("2-ch1oropropanoic acid") print(results[0].name) 2-chloropropanoic acid
With custom configuration
config = CorrectorConfig(max_candidates=50) corrector = ChemNameCorrector(config)
With external validation
validator = PubChemValidator() results = corrector.correct("asprin", validator=validator)
Attributes:
| Name | Type | Description |
|---|---|---|
config |
Configuration for the corrector |
|
strategies |
List of active correction strategies |
|
scorer |
Scoring instance for ranking candidates |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | |
__init__(config=None, strategies=None)
Initialize the chemical name corrector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Optional[CorrectorConfig]
|
Configuration object (uses defaults if None) |
None
|
strategies
|
Optional[List[CorrectionStrategy]]
|
Custom list of strategies (uses defaults if None) |
None
|
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | |
add_strategy(strategy)
Add a custom correction strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
CorrectionStrategy
|
The strategy to add |
required |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
122 123 124 125 126 127 128 129 | |
correct(name, use_validator=True, validate_all=False)
Correct a chemical name and return ranked candidates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The chemical name to correct |
required |
use_validator
|
bool
|
Whether to use external validator |
True
|
validate_all
|
bool
|
Whether to validate all candidates or just the top ones |
False
|
Returns:
| Type | Description |
|---|---|
List[CorrectionCandidate]
|
List of CorrectionCandidate objects, sorted by score (descending) |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | |
correct_batch(names, use_validator=True, validate_all=False)
Correct multiple chemical names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
List[str]
|
List of chemical names to correct |
required |
use_validator
|
bool
|
Whether to use external validator |
True
|
validate_all
|
bool
|
Whether to validate all candidates or just the top ones |
False
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[CorrectionCandidate]]
|
Dictionary mapping original names to their candidates |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | |
explain_corrections(candidate)
Generate a human-readable explanation of corrections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
candidate
|
CorrectionCandidate
|
The candidate to explain |
required |
Returns:
| Type | Description |
|---|---|
str
|
Multi-line string explaining all corrections |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | |
get_best_candidate(name, use_validator=True)
Get the single best correction candidate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Chemical name to correct |
required |
validator
|
Optional external validator |
required |
Returns:
| Type | Description |
|---|---|
Optional[CorrectionCandidate]
|
Best candidate, or None if no candidates found |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | |
remove_strategy(strategy_name)
Remove a strategy by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_name
|
str
|
Name of the strategy to remove |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if strategy was found and removed, False otherwise |
Source code in cholla_chem/name_manipulation/name_correction/name_corrector.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
ChemSpiPyResolver
Bases: ChemicalNameResolver
Resolver using chemspipy.
Source code in cholla_chem/main.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using ChemSpiPy.
Source code in cholla_chem/main.py
245 246 247 248 249 250 251 252 253 254 255 | |
ChemicalNameResolver
Bases: ABC
Abstract base class for chemical name-to-SMILES resolvers.
Subclasses must implement the name_to_smiles method.
Source code in cholla_chem/main.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
rate_limit_time
property
Return rate_limit_time.
requires_internet
property
Return requires_internet.
resolver_name
property
Return resolver_name.
resolver_weight
property
Return resolver_weight.
name_to_smiles(compound_name_list)
abstractmethod
Convert chemical names to SMILES strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compound_name_list
|
List[str]
|
List of chemical names. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, str], Dict[str, str]]
|
Tuple of: - Dict mapping successful names to SMILES. - Dict mapping failed names to error messages. |
Source code in cholla_chem/main.py
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | |
CorrectorConfig
dataclass
Configuration for the ChemNameCorrector.
Attributes:
| Name | Type | Description |
|---|---|---|
max_candidates |
int
|
Maximum number of candidates to generate |
max_corrections_per_candidate |
int
|
Maximum corrections per candidate |
min_score_threshold |
float
|
Minimum score to include candidate in results |
enable_character_substitution |
bool
|
Enable OCR character correction |
max_character_substitution_edits |
bool
|
Max number of substitution edits |
enable_punctuation_restoration |
bool
|
Enable missing punctuation detection |
enable_bracket_balancing |
bool
|
Enable bracket matching correction |
custom_substitutions |
Dict[str, List[str]]
|
Additional user-defined substitution rules |
custom_rules |
List[CorrectionRule]
|
Additional user-defined correction rules |
enable_external_validation |
bool
|
Enable external validation of candidates |
Source code in cholla_chem/name_manipulation/name_correction/dataclasses.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
InorganicShorthandNameResolver
Bases: ChemicalNameResolver
Resolver using inorganic shorthand (e.g. [Cp*RhCl2]2).
Source code in cholla_chem/main.py
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using inorganic shorthand converter.
Source code in cholla_chem/main.py
352 353 354 355 356 357 358 359 | |
ManualNameResolver
Bases: ChemicalNameResolver
Resolver using manually curated names and corresponding SMILES.
Source code in cholla_chem/main.py
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 | |
name_to_smiles(compound_name_list, provided_name_dict=None)
Convert chemical names to SMILES using manual name database.
Source code in cholla_chem/main.py
290 291 292 293 294 295 296 297 298 299 300 301 | |
OpsinNameResolver
Bases: ChemicalNameResolver
Resolver using OPSIN via py2opsin.
Source code in cholla_chem/main.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using OPSIN.
Source code in cholla_chem/main.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
PubChemNameResolver
Bases: ChemicalNameResolver
Resolver using PubChem via PubChemPy.
Source code in cholla_chem/main.py
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using pubchem.
Source code in cholla_chem/main.py
181 182 183 184 185 186 187 188 | |
StructuralFormulaNameResolver
Bases: ChemicalNameResolver
Resolver using structural chemical formula (e.g. CH3CH2CH2COOH).
Source code in cholla_chem/main.py
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | |
name_to_smiles(compound_name_list)
Convert chemical names to SMILES using structural formula converter.
Source code in cholla_chem/main.py
323 324 325 326 327 328 329 330 | |
resolve_compounds_to_smiles(compounds_list, resolvers_list=[], smiles_selection_mode='weighted', detailed_name_dict=False, batch_size=500, normalize_unicode=True, split_names_to_solve=True, resolve_peptide_shorthand=True, attempt_name_correction=True, internet_connection_available=True, name_correction_config=None)
Resolve a list of compound names to their SMILES representations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
compounds_list
|
List[str]
|
A list of compound names. |
required |
resolvers_list
|
List[ChemicalNameResolver]
|
A list of ChemicalNameResolver instances. Defaults to []. |
[]
|
smiles_selection_mode
|
str
|
The method to select the SMILES representation from multiple resolvers. Defaults to 'weighted'. |
'weighted'
|
detailed_name_dict
|
bool
|
If True, returns a dictionary with detailed information about each compound. Defaults to False. |
False
|
batch_size
|
int
|
The number of compounds to process in each batch. Defaults to 500. |
500
|
normalize_unicode
|
bool
|
Whether to normalize Unicode characters in compound names. Defaults to True. |
True
|
split_names_to_solve
|
bool
|
Whether to split compound names on common delimiters to solve them as separate compounds. Can be used to solve otherwise unresolvable compound names such as BH3•THF. Defaults to True. |
True
|
resolve_peptide_shorthand
|
bool
|
Whether to resolve peptide shorthand notation. Defaults to True. |
True
|
attempt_name_correction
|
bool
|
Whether to attempt to correct compound names that are misspelled or contain typos. Defaults to True. |
True
|
internet_connection_available
|
bool
|
Whether an internet connection is available to resolve compound names. Defaults to True. |
True
|
name_correction_config
|
CorrectorConfig
|
Configuration for name correction. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, CompoundResolutionEntry] | Dict[str, CompoundResolutionEntryWithNameCorrection] | Dict[str, str]
|
Dict[str, Dict[str, Dict[str, List[str]]]] | Dict[str, str]: A dictionary mapping each compound to its SMILES representation and resolvers, or a simple dictionary mapping each compound to it's selected SMILES representation. |
Source code in cholla_chem/main.py
651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 | |